CN112560425A - Template generation method and device, electronic equipment and storage medium - Google Patents

Template generation method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112560425A
CN112560425A CN202011556696.8A CN202011556696A CN112560425A CN 112560425 A CN112560425 A CN 112560425A CN 202011556696 A CN202011556696 A CN 202011556696A CN 112560425 A CN112560425 A CN 112560425A
Authority
CN
China
Prior art keywords
candidate
historical search
template
word
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011556696.8A
Other languages
Chinese (zh)
Other versions
CN112560425B (en
Inventor
潘秋桐
李瑞高
李雅楠
何伯磊
刘准
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202011556696.8A priority Critical patent/CN112560425B/en
Publication of CN112560425A publication Critical patent/CN112560425A/en
Application granted granted Critical
Publication of CN112560425B publication Critical patent/CN112560425B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/186Templates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The disclosure provides a template generation method, a template generation device, electronic equipment, a storage medium and a computer program product, and relates to the fields of intelligent search, intelligent recommendation and the like. The specific implementation scheme is as follows: acquiring M historical search texts and click resources corresponding to the M historical search texts respectively; m is an integer greater than or equal to 1; clustering the M historical search texts based on the relevant information of the clicked resources respectively corresponding to the M historical search texts to obtain N sample sets; n is an integer greater than or equal to 1; and determining target templates respectively related to the N sample sets and similar words contained in word slots of the target templates based on a plurality of historical search texts respectively contained in the N sample sets.

Description

Template generation method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of computer technology. The present disclosure relates to the field of intelligent search and intelligent recommendation, among others.
Background
In the search field, through statistical analysis of a user's search text, when the user has a specific intention for a certain information or resource, the used search text may conform to a certain pattern. These texts having the same pattern are summarized to form a form of a template, and the user's intention can be recognized with relative convenience using the template. However, how to generate the template efficiently and accurately becomes a problem to be solved.
Disclosure of Invention
The disclosure provides a template generation method, a template generation device, an electronic device, a storage medium and a computer program product.
According to a first aspect of the present application, there is provided a template generation method, including:
acquiring M historical search texts and click resources corresponding to the M historical search texts respectively; m is an integer greater than or equal to 1;
clustering the M historical search texts based on the relevant information of the clicked resources respectively corresponding to the M historical search texts to obtain N sample sets; n is an integer greater than or equal to 1;
and determining target templates respectively related to the N sample sets and similar words contained in word slots of the target templates based on a plurality of historical search texts respectively contained in the N sample sets.
According to a second aspect of the present application, there is provided a template generating apparatus comprising:
the information acquisition module is used for acquiring M historical search texts and click resources corresponding to the M historical search texts respectively; m is an integer greater than or equal to 1;
the clustering module is used for clustering the M historical search texts based on the relevant information of the click resources respectively corresponding to the M historical search texts to obtain N sample sets; n is an integer greater than or equal to 1;
and the generating module is used for determining target templates respectively related to the N sample sets and similar words contained in word slots of the target templates based on a plurality of historical search texts respectively contained in the N sample sets.
According to a third aspect of the present application, there is provided an electronic device comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the above-described method.
According to a fourth aspect of the present application, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the above method.
According to a fifth aspect of the application, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the above-described method.
By adopting the technical scheme, the sample set can be generated based on the historical search text with the clicked resource, and the corresponding target template and the similar words contained in the word slots in the target template can be determined according to the historical search text contained in the sample set. Therefore, the target template and the similar words in the word slot are finally obtained by automatically analyzing the historical search sample and the corresponding click resources thereof, the accuracy and the efficiency of generating the template are improved, and the template tree and the word slot tree are more accurately and efficiently updated or constructed and predicted for subsequent use of the template.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a schematic flow diagram of a template generation method according to an embodiment of the present disclosure;
FIG. 2 is a flow chart illustrating a process for adding new words to a segmentation dictionary according to an embodiment of the present disclosure;
FIG. 3 is a schematic diagram of the composition of a word slot tree, according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of the composition of a template tree according to an embodiment of the present disclosure;
FIG. 5 is a schematic diagram of an alternating process based on a word slot tree and a template tree, according to an embodiment of the present disclosure;
FIG. 6 is a first schematic diagram illustrating a first exemplary structure of a template generating apparatus according to an embodiment of the disclosure;
FIG. 7 is a schematic diagram of a second exemplary embodiment of a template generating apparatus;
fig. 8 is a block diagram of an electronic device for implementing a template generation method of an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
An embodiment of the present disclosure provides a template generating method, as shown in fig. 1, including:
s101: acquiring M historical search texts and click resources corresponding to the M historical search texts respectively; m is an integer greater than or equal to 1;
s102: clustering the M historical search texts based on the relevant information of the clicked resources respectively corresponding to the M historical search texts to obtain N sample sets; n is an integer greater than or equal to 1;
s103: and determining target templates respectively related to the N sample sets and similar words contained in word slots of the target templates based on a plurality of historical search texts respectively contained in the N sample sets.
The embodiment can be applied to electronic devices, and can be devices with image processing functions, such as terminal devices, servers and the like.
In this embodiment, the obtaining manner of the M historical search texts may include: acquiring all historical search texts within a preset duration, and extracting M historical search texts with corresponding click resources from all the historical search texts.
The click resource corresponding to any one of the M historical search texts may mean that, after the user inputs the historical search text, one or more candidate data resources are recalled based on the historical search text; and selecting and clicking one data resource from the one or more candidate data resources by the user, and taking the data resource clicked by the user as a click resource corresponding to the historical search text.
In particular, the data resource may be data indexed in a search engine that may be retrieved. The data resource can be a data resource which meets the requirements of users on information retrieval and website addressing; exemplary said data resources may include: any one of document data and website resources. For example, the main text content of the document data may include a title and a content; the main text content of the website Resource may include a title, a summary, a Uniform Resource Locator (URL), and so on.
The information related to the clicked resource may include at least one of the following: the label of the click resource, the category of the click resource and the source of the click resource.
The number of the labels of the click resources can be one or more; the number of types of the clicked resources is usually one, and a plurality of types of the clicked resources can be provided according to actual conditions; the source of the click resources can include the click resources from official networks or unofficial networks and the like.
The relevant information of the clicked resources is configured in advance, for example, the category of each clicked resource can be obtained through prediction of a binary model in advance; the tags of the clicked resources and the like may also be determined based on a keyword extraction method.
Each of the N sample sets contains a plurality of historical search texts.
That is to say, the M historical search texts are clustered based on the relevant information of the click resources corresponding to the M historical search texts, and the historical search texts with the same at least one of the category, the label and the source of the click resources are added to the same sample set.
The determining, based on a plurality of historical search texts included in the N sample sets, target templates to which the N sample sets are respectively related and target similar words included in word slots in the target templates may include:
performing word segmentation processing on a plurality of historical search texts contained in the jth sample set of the N sample sets to obtain word segmentation results of the plurality of historical search texts;
determining a candidate template based on the word segmentation results of the plurality of historical search texts, and determining candidate similar words contained in each word slot of one or more word slots in the candidate template;
determining a target template corresponding to the jth sample set based on the confidence degrees corresponding to the candidate templates respectively;
and determining the target similar words of each word slot based on the word confidence of the candidate similar words contained in each word slot in the target template corresponding to the jth sample set.
The jth sample set may be any one of N sample sets.
One or more candidate templates corresponding to the jth sample set can be provided; one or more target templates corresponding to the jth sample set can be provided; the number of the candidate similar words and the target similar words in each word slot may also be an integer greater than or equal to 1.
Furthermore, according to the scheme provided in this embodiment, the template tree and the word slot tree may be constructed or updated based on the target templates corresponding to all N sample sets and the target similar words contained in the word slots in the target templates.
Based on the template tree and the word slot tree, the obtained current search text can be analyzed, and an intention identification result corresponding to the current search text is obtained.
Therefore, by adopting the scheme, the sample set can be generated based on the historical search text with the clicked resource, and the corresponding target template and the similar words contained in the word slots in the target template are determined according to the historical search text contained in the sample set. Therefore, the target template and the similar words in the word slot are finally obtained by automatically analyzing the historical search sample and the corresponding click resources thereof, the accuracy and the efficiency of generating the template are improved, and the template tree and the word slot tree are more accurately and efficiently updated or constructed and predicted for subsequent use of the template.
The clustering the M historical search texts based on the relevant information of the click resources respectively corresponding to the M historical search texts to obtain N sample sets includes:
clustering the M historical search texts based on the relevant information of the click resources respectively corresponding to the M historical search texts to obtain K candidate sample sets; k is an integer greater than or equal to N;
and selecting the N sample sets from the K candidate sample sets.
Here, the information related to the clicked resource may include a tag and/or a category of the clicked resource. In addition, the information related to the clicked resource may include, in addition to the category and the tag, a source of the clicked resource, for example, whether the clicked resource originates from an official website.
It should be noted that there may be one or more labels of the clicked resource, and there may be one or more categories of the clicked resource. In addition, the labels and/or categories of the clicked resources may be predetermined.
Specifically, the manner of determining the tags of the clicked resources may be determined by means of keyword extraction. For example, the content of the document corresponding to the clicked resource may be subjected to keyword extraction, and the tag of the clicked resource may be determined. Illustratively, one or more tags of the clicked resource may be obtained by extracting keywords based on a document title corresponding to the clicked resource.
For example, the click resource 1 corresponds to the document 1, the title of the document 1 is "ABCDE", two keywords "AB" and "DE" can be determined by keyword extraction, and the labels of the click resource 1 are "AB" and "DE".
The manner of determining the category of the click resource may be predicted by one or more classification models. The one or more secondary classification models are obtained by pre-training, and the generation manner of the one or more secondary classification models is not limited in this embodiment. The category of the click resource may be one or more candidate categories.
Illustratively, the candidate categories may include technology, news, administration, manpower, and the like. It should be understood that there may be more possible candidate categories in the actual processing, but this embodiment is not exhaustive.
The M historical search texts may be a part of all the historical search texts within a preset time period. Preferably, the M historical search texts are historical search texts with corresponding click resources. Wherein M may be an integer of 1 or more.
That is, the historical search texts with the corresponding click resources within the preset duration can be regularly sorted for each preset duration. Further, according to at least one of the category, the label and the source of the click resource corresponding to each historical search text, clustering the M historical search texts to obtain N sample sets.
The following table is given as an example:
Figure BDA0002858494480000061
Figure BDA0002858494480000071
the historical search texts 1, 2, 3 and 4 with the clicked resources in the same category and the same label are integrated into a sample set alpha.
The historical search texts 5, 6, 7 and 8 with the same source of the click resources (all from the official website) are integrated into a sample set beta.
Therefore, by adopting the processing, the historical search texts can be clustered based on the relevant information of the clicked resources, and a sample set consisting of a plurality of historical search texts is obtained; therefore, the subsequent processing of segmenting the historical search text in each sample set to finally obtain the template can be more efficient and accurate.
In the foregoing case of obtaining the candidate sample set, the candidate sample set may be further filtered, specifically, the selecting the N sample sets from the K candidate sample sets includes at least one of:
counting the number of historical search texts contained in an ith candidate sample set in the K candidate sample sets, and taking the ith candidate sample set as one of the N sample sets under the condition that the number of the historical search texts reaches a first preset number; i is an integer of 1 or more and K or less;
counting the number of the historical search texts of the target type contained in the ith candidate sample set of the K candidate sample sets, and taking the ith candidate sample set as one of the N sample sets when the number of the historical search texts of the target type reaches a second preset number:
and obtaining user identifications associated with the historical search text contained in the ith candidate sample set of the K candidate sample sets, removing the duplication of the user identifications associated with the historical search text to obtain the number of the user identifications, and taking the ith candidate sample set as one of the N sample sets under the condition that the number of the user identifications reaches a third preset number.
Respectively, the first way: counting the number of historical search texts contained in an ith candidate sample set in the K candidate sample sets, and taking the ith candidate sample set as one of the N sample sets under the condition that the number of the historical search texts reaches a first preset number; i is an integer of 1 or more and K or less.
The first preset number may be set according to actual conditions, and may be 500, for example.
That is, the number of the historical search texts contained in each candidate sample set is counted, and if the number of the historical search texts contained in the candidate sample set is greater than a first preset number, the candidate sample set is used as a finally used sample set; otherwise, the set of candidate samples may be deleted.
In a second manner, the number of the historical search texts of the target type contained in the ith candidate sample set of the K candidate sample sets is counted, and the ith candidate sample set is taken as one of the N sample sets when the number of the historical search texts of the target type reaches a second preset number.
The history search text of the target type may be a unique history search text, that is, the history search text is different from other history search texts.
The second preset number may be set according to actual conditions, and may be 100, for example.
That is, the number of the historical search texts of the target type contained in each candidate sample set is counted respectively, and if the number of the historical search texts contained in the candidate sample set is greater than a second preset number, the candidate sample set is used as a finally used sample set; otherwise, the set of candidate samples may be deleted.
The third mode is as follows: and obtaining user identifications associated with the historical search text contained in the ith candidate sample set of the K candidate sample sets, removing the duplication of the user identifications associated with the historical search text to obtain the number of the user identifications, and taking the ith candidate sample set as one of the N sample sets under the condition that the number of the user identifications reaches a third preset number.
The third preset number may be set according to actual conditions, and may be 100, for example.
For example, the ith candidate sample set includes 600 historical search texts, wherein 200 historical search texts are associated with the user identifier-1, another 200 historical search texts are associated with the user identifier-2, and another 100 historical search texts are associated with the user identifier-3, and another 100 associated user identifiers-4, that is, the number of the user identifiers associated with the historical search texts in the ith candidate sample set is 4, and is smaller than the third preset number, the ith candidate sample set needs to be deleted. Therefore, the frequent operation of several users can be avoided, and the generalization of the finally obtained sample set is further ensured.
All three modes can be used, only one mode or two modes can be used, and the combination modes are not exhaustive here.
Therefore, by filtering the candidate sample set, the condition that the confidence coefficient of the finally used sample set is low can be avoided, and the generalization of the sample set is ensured, so that more accurate information is provided for determining the template based on the historical search text contained in the sample set.
The determining, based on a plurality of historical search texts included in the N sample sets, a target template related to the N sample sets and similar words included in a word slot of the target template includes:
determining candidate templates related to the jth sample set and candidate similar words contained in word slots of the candidate templates based on L historical search texts contained in the jth sample set in the N sample sets; j is an integer of 1 or more and N or less; l is an integer greater than or equal to 1;
selecting a candidate template with a template confidence degree larger than a template confidence degree threshold value from the candidate templates as a target template related to the jth sample set; and selecting candidate similar words with word confidence degrees larger than a word confidence degree threshold value from the candidate similar words contained in the word slot of the target template related to the jth sample set as the similar words contained in the word slot of the target template.
The jth sample set is any one of the N sample sets, and processing of the N sample sets is the same, so that description is omitted.
The candidate template related to the jth sample set and the candidate similar words contained in the word slot of the candidate template specifically include: a plurality of candidate templates associated with the jth sample set and candidate homogeneous words contained in the word slots of each of the plurality of candidate templates.
Further, the word slot of each candidate template may include one or more word slots, and accordingly, each word slot of the one or more word slots may include one or more candidate homogeneous words.
The template confidence of the candidate template can be determined according to the use frequency of the candidate template in the processing process; the word confidence of the candidate similar words may be determined according to the occurrence frequency of the candidate similar words in the processing process.
For example, there are 100 historical search texts in the jth sample set, and the candidate template 1 is used 50 times in the processing process, that is, 50 historical search texts can be matched, so that the template confidence of the candidate template is 50/100; if a word 1 in the word slot 1 of the candidate template 1 occurs 30 times during the process of matching the candidate template 1 with 50 historical search texts, the confidence level of the word 1 is 30/50.
Selecting a candidate template with a template confidence degree greater than a template confidence degree threshold value from the candidate templates as a target template related to the jth sample set, which may specifically be: and judging whether the template confidence of each candidate template exceeds a template confidence threshold, if so, taking the template as a target template related to the jth sample set, and otherwise, deleting the candidate template.
Here, the template confidence threshold may be set according to actual situations, for example, may be set to 0.35, or higher or lower, and this embodiment does not limit this.
Selecting candidate similar words with word confidence greater than a word confidence threshold from the candidate similar words contained in the word slot of the target template related to the jth sample set as the similar words contained in the word slot of the target template, which may specifically be:
and judging whether the word confidence of each candidate similar word contained in each word slot is greater than the dictionary confidence, if so, taking the candidate similar word as the similar word contained in the word slot in the target template, and otherwise, deleting the candidate similar word.
The dictionary confidence may be set according to actual conditions, for example, may be set to be 0.15, or may be set to be larger or smaller, and this embodiment does not limit this.
Therefore, by adopting the scheme, after the candidate template and the candidate similar words contained in the candidate template are determined based on the historical search text of each sample set, the filtering is further carried out based on the template confidence coefficient of the candidate template and the word confidence coefficient of the candidate similar words, and the similar words of the word slot in the target template and the target template are obtained. Therefore, the obtained target template and the similar words in the word slot can be more accurate, and the template tree constructed or updated based on the target template and the similar words in the word slot is further ensured to be more accurate.
Before the word segmentation results corresponding to the L historical search texts included in the jth sample set, the method may further include:
and performing word segmentation processing on the L historical search texts contained in the jth sample set respectively to obtain word segmentation results corresponding to the L historical search texts respectively.
In the word segmentation results respectively corresponding to the L historical search texts, the word segmentation result corresponding to each historical search text may include: one or more words obtained by dividing each historical search text. For example, a history search text divided for "ABCD" may result in 3 words "AB", "C" and "D".
The word segmentation processing is a common processing step in chinese NLP (Natural language processing), and if the word segmentation effect is poor, the final effect of the model is often affected. Therefore, in the embodiment, the jieba word segmentation device is used, and the entry weighting function of the jieba word segmentation device is used to ensure the accuracy of the word segmentation effect as much as possible.
For example, before and after [ ABC 1000] is configured in the segmentation dictionary, the segmentation varies as follows:
before [ ABC 1000] is configured in the word segmentation dictionary: query is ABC jieba participle: AB/C;
after [ ABC 1000] is configured in the word segmentation dictionary: query is ABC jieba participle: and (5) ABC.
Wherein, it is to be noted that ABC in ABC 1000 is an entry or a word, and 1000 is a weight of the entry or the word. When the jieba word segmentation device is used for segmenting words, all word segmentation methods of the input character string are traversed, calculation is carried out on each word segmentation method according to a formula based on term (noun) length and weight, and then a word segmentation method with the highest score is reserved and output.
In the above-mentioned process of segmenting words by using the word segmentation tool, it is extremely important to configure the word segmentation tool with a large number of entries and weights of the word segmentation dictionary. The obtaining of the entry in the word segmentation dictionary may include: performing feature calculation on each original data to obtain one or more candidate words contained in each original data; the entries and their weights in the segmentation tool are updated based on the one or more candidate words. Specifically, as can be seen in fig. 2, the method may include:
s201: preprocessing the original data to obtain preprocessed original data;
here, the preprocessing the raw data may include: deleting a part of a preset type contained in the original data; the preset type of part may include: punctuation marks, stop words, spaces, etc., that is, the contents of punctuation marks, stop words, spaces, etc., in the original data may be deleted.
S202: performing feature extraction on the preprocessed original data to obtain feature information corresponding to one or more candidate words contained in the preprocessed original data;
the characteristic information may include word frequency, degree of freedom, degree of solidification, and the like.
S203: and filtering the one or more candidate words to obtain one or more words to be added based on the characteristic information corresponding to the one or more candidate words respectively, and adding the one or more words to be added to the segmentation dictionary.
Wherein, the filtering process can be carried out through a model and a dictionary; specifically, the one or more candidate words may be filtered through the model to obtain one or more preselected words; and performing dictionary filtering on the one or more preselected words obtained after model filtering to obtain the one or more words to be added.
The filtering of the one or more candidate words by the model may be to input each candidate word into the model, obtain a recognition result output by the model, for example, the recognition result may be a deletion in which the candidate word is true or false, the recognition result characterizes the candidate word as true as a preselected word, and the recognition result characterizes the candidate word as false.
Performing dictionary filtering on the one or more preselected words obtained after the model filtering to obtain the one or more words to be added, which may specifically be: and matching with each pre-selected word based on the existing word stock and the common words, if the pre-selected word is contained in the existing word stock or the common words, deleting the pre-selected word, and if not, taking the pre-selected word as the word to be added.
By the method, a batch of proper nouns and new words (namely the words to be added) can be obtained, and the words are put into a word segmentation dictionary of the jieba word segmentation device.
The determining a candidate template related to a jth sample set and candidate similar words contained in a word slot of the candidate template based on L historical search texts contained in the jth sample set of the N sample sets comprises:
determining a kth group of co-occurrence words based on word segmentation results respectively corresponding to the L historical search texts contained in the jth sample set; k is an integer of 1 or more;
taking the P historical search texts containing the kth group of co-occurrence words in the jth sample set as a kth sub-sample set; p is an integer of 1 or more and L or less;
determining first-class words except the kth group of co-occurring words according to word segmentation results respectively corresponding to the P historical search texts of the kth sub-sample set, and determining initial words in word slots of each candidate template in the kth group of candidate templates and the kth group of candidate templates according to the kth group of co-occurring words and the first-class words;
determining the candidate similar words respectively contained in the word slots of the candidate templates in the kth group of candidate templates based on the word segmentation results respectively corresponding to the L historical search texts in the jth sample set and the initial words in the word slots of the candidate templates in the kth group of candidate templates.
One or more co-occurring words may be included in the kth group of co-occurring words.
The determining a kth group of co-occurring words based on the word segmentation results respectively corresponding to the L historical search texts included in the jth sample set may be:
and searching words with the frequency exceeding a frequency threshold value from word segmentation results respectively corresponding to the L historical search texts contained in the jth sample set to serve as co-occurrence words, and taking one or more of the co-occurrence words as the kth group of co-occurrence words.
The frequency threshold may be set according to an actual situation, for example, the frequency threshold may be 0.2, for example, if a word appears 10 times in 100 historical search texts, the frequency of the occurrence of the word is 10/100 and is less than 0.2, and the word is not a co-occurrence word; if a word occurs 34 times, the frequency of occurrence is 34/100, which is greater than 0.2, the word can be regarded as a co-occurrence word.
In addition, it should be noted that taking one or more of the co-occurring words as the kth group of co-occurring words may mean that, if there are a plurality of co-occurring words found currently, each co-occurring word may be taken as a group of co-occurring words respectively; two or three or more of the plurality of co-occurring words may also be considered a set of co-occurring words. The kth group of co-occurrence words may be any one of the above groups of co-occurrence words.
For example, the jth sample set includes 100 historical search texts, where "word a" appears 30 times, word b "appears 20 times, and the remaining words appear fewer words, then" word a "may be the kth group of co-occurring words.
As another example, the jth sample set includes 100 historical search texts, wherein the word a appears 30 times in the segmentation results of the 100 historical search texts, the word b appears 30 times, and the remaining words appear fewer words, so that the word a and the word b can be taken together as the kth group of co-occurring words.
The taking the P historical search texts containing the kth group of co-occurrence words in the jth sample set as the kth sub-sample set specifically may include: traversing the L historical search texts contained in the jth sample set, performing data bucket separation on the L historical search texts according to the kth group of co-occurrence words to obtain single-bucket data of P historical search texts containing the kth group of co-occurrence words, and enabling the single-pass data to be the kth sub-sample set.
Determining, by the word segmentation result respectively corresponding to the P historical search texts based on the kth sub-sample set, a first class of words except the kth group of co-occurring words, including: acquiring other words except the k group of co-occurrence words from word segmentation results respectively corresponding to the P historical search texts of the k sub-sample set; selecting the first class of terms from the other terms based on the similarity of the other terms.
Wherein the similarity may be determined based on the edit distance and/or the word sense similarity. The edit distance may refer to the minimum number of edit operations required between two words to transition from one to another. The word meaning similarity may be obtained based on a word meaning recognition model, for example, two other words may be input to the word meaning recognition model to obtain a recognition result output by the word meaning recognition model, and if the recognition result is true, the two other words may be represented as similar, otherwise, the two other words are not similar.
Illustratively, the selecting the first category word from the other words based on the similarity of the other words may include: and counting the occurrence times of the same words in the other words, and taking the same words with the most occurrence times as the first-type words. Wherein the same word may refer to the aforementioned edit distance of 0. For example, P is 30, that is, the kth sub-sample set includes 30 historical search texts, and the 30 historical search texts include other words besides the kth group of co-occurring words, for example, there may be a word a, a word B, a word C, and the like, where the word a appears 20 times, the word B appears 1 time, and the word C appears 15 times, and then the word a may be the same word appearing the most times, and the word a is the first-type word.
Determining the kth group of candidate templates and the initial terms in the term slots of each candidate template in the kth group of candidate templates based on the kth group of co-occurring terms and the first class of terms, which may specifically include:
determining one or more word slots in each of the kth set of candidate templates based on the kth set of co-occurring words and the first class of words; and determining an initial word in each word slot in each of the candidate templates.
For example, one candidate template is WD1WD2W, wherein the initial terms in D1 may be co-occurring terms and the initial terms in D2 may be the first class of terms.
Determining the candidate similar words respectively contained in the word slots of the candidate templates in the kth group of candidate templates based on the word segmentation results respectively corresponding to the L historical search texts in the jth sample set and the initial words in the word slots of the candidate templates in the kth group of candidate templates, specifically:
traversing the L historical search texts in the jth sample set, searching candidate words with the same category as the initial words in each word slot of each candidate template from the L historical search texts, adding the candidate words into the corresponding word slot, and finally obtaining one or more candidate similar words contained in each word slot of each candidate template.
In the solution provided in this embodiment, the processing of template identification may be an idea of using Snowball. Providing an exemplary illustration may comprise:
performing word segmentation on the L historical search texts in a jth sample set of the multiple sample sets to obtain word segmentation results corresponding to the L historical search texts contained in the jth sample set. Here, each participle in the participle results respectively corresponding to the L historical search texts included in the jth sample set may be taken as an entity.
Traversing the jth sample set, and searching a historical search text containing a kth group of co-occurrence words (or can be called as co-occurrence/co-occurrence entities) for data bucket division; i.e. the kth set of subsamples. For example, putting co-occurrence words a in the L history search texts into one bucket, putting co-occurrence words B into one bucket, and putting co-occurrence words a + co-occurrence words B into one bucket, it should be understood that there may be some history search texts in the buckets corresponding to different groups of co-occurrence words that are the same.
It should be further noted that, when finding co-occurring words, 1 co-occurring word, 2 co-occurring words, and 3 co-occurring words may be sequentially processed, and when obtaining candidate templates using different numbers of co-occurring words, the candidate templates W-D-W, W-D-W-D-W, W-D-W-D-W may respectively correspond to the candidate templates W-D-W, W-D-W-D-W, W-D-W, of course, the number of candidate templates that can be obtained by a group of co-occurring words may be one or multiple, which is not limited in this embodiment.
And analyzing the similarity degree of texts except the kth group of co-occurring words (or single-side entities) in the word segmentation results corresponding to the P historical search texts based on the word segmentation results corresponding to the kth group of co-occurring words in the k sub-sample set (namely, single-bucket data corresponding to the kth group of co-occurring words) corresponding to the kth group of co-occurring words, and finding out a common paradigm as the kth group of candidate templates.
Traversing the jth sample set by using the kth group of candidate templates, and searching words (or called homogeneous entities) in the same category as the kth group of co-occurring words from the L historical search texts contained in the jth sample set; adding the words in the same category as the k-th group of co-occurring words into the corresponding word slots of the k-th group of candidate templates.
Repeating, traversing the jth sample set, and searching a history search text containing a (k + 1) th group of co-occurrence words (or can be called as a co-occurrence/co-occurrence entity) for data bucket division; that is, the (k + 1) th sub-sample set is obtained. The subsequent processing is the same as that of the k-th group of co-occurrence words, and is not described in detail.
In the processing, the frequency of matching each candidate template to the historical search text is recorded, and the template confidence of each candidate template is determined based on the frequency; and recording the frequency of occurrence of each candidate similar word in each candidate template under the condition that the historical search text is matched with each candidate template, and determining the word confidence of each candidate similar word based on the frequency. Giving a candidate set according to the confidence score, namely determining a target template related to the jth sample set based on the template confidence of the candidate template; and determining similar words contained in the word slot in the target template based on the word confidence of the candidate similar words contained in the word slot in the target template.
For example, suppose that the jth sample set is a batch of 7 historical search texts clicked on the addressed site of the conference room as follows: how to go in conference room a; where conference room a is; where conference room B is; conference room B location; the location of conference room a; a conference room C; conference room a.
Performing word segmentation on the 7 historical search texts, wherein the word segmentation result is as follows:
how to go in conference room A- - > how/go in conference room A
Where conference room A is- - > conference room A/where
Where conference room B is — > conference room B/where
Conference room B location > conference room B/location
Conference room A location > conference room A/location
Conference room C- - > conference room C
Conference room A- - > conference room A
Firstly, a word with the highest occurrence frequency can be found as a group 1 co-occurrence word, namely a meeting room A;
and traversing the jth sample set, and clustering the historical search texts containing the conference room A to obtain a 1 st sub-sample set consisting of 4 historical search texts.
Finding out the positions, the positions and the walking of the texts except the meeting room A in the 4 historical search texts through the 1 st sub-sample set; because the frequency of occurrence of the three words is the same, the three words can be respectively taken as the first-class words for subsequent processing.
Obtaining the candidate template may include two of: WDW, D ═ meeting room a ]; another is WD1WD2W, where D1 ═ meeting room a ], D2 ═ position [ [ where ], [ how to go ].
Traversing the 7 historical search texts included in the jth sample set based on the candidate templates, and obtaining the homogeneous terms of D1 may include meeting room B.
Finally, with the jth sample set, the result can be:
target template 1: W-D-W, D ═ meeting room a 4/6, meeting room B2/6 ], template confidence of 6/7; wherein "4/6" in "meeting room a 4/6" is the word confidence for the candidate word of "meeting room a".
Target template 2: W-D1-W-D2-W, D1 ═ meeting room a, meeting room B, D2 ═ position, where, how, and template confidence of 5/7.
By adopting the scheme, the sample set can be generated based on the historical search text with the clicked resource, and the corresponding target template and the similar words contained in the word slots in the target template are determined according to the historical search text contained in the sample set. Therefore, the target template and the similar words in the word slot thereof are finally obtained by automatically analyzing the historical search sample and the corresponding click resources thereof, the influence of artificial factors in the template generation processing can be reduced, the accuracy and the generation efficiency of the finally generated template are ensured, and the processing such as updating or constructing the template tree and the word slot tree and predicting the template tree and the word slot tree for subsequent use of the template is more accurate and efficient.
After the above processing is completed, the method further comprises: updating a template tree based on the target templates respectively related to the N sample sets; updating a word slot tree based on the homogeneous words contained in the word slots of the target template to which the N sample sets are respectively related.
That is, all target templates corresponding to each sample set in all sample sets are used for constructing or updating the template tree; and all similar words contained in all word slots in all target templates are used for constructing or updating the word slot tree.
The target template is a text expression template and can express the corresponding intention of the search text. For example, a template a comprising "city to city" is a common template with the intent of purchasing tickets.
The target template comprises one or more word slots; each word slot in the one or more word slots contains one or more words in the same category, or one or more words in the same category in each word slot is called as a dictionary corresponding to the word slot. It should be understood that each word slot contained in the target template, specifically, may be referred to as a word slot of a word in class, and the word categories related to the word slot may be set according to the word in class contained in the word slot.
For example, the [ city ], [ traffic mode ] in the template a "[ city ] to [ traffic mode ] is the word slot; the dictionary of word slots [ cities ] may include: beijing, shanghai, chengde, etc., which are not exhaustive herein.
Based on all the target templates obtained in this embodiment, a template tree may be constructed or updated.
The template tree is as follows: assembling the template in a dictionary (trie) tree form to obtain the template; the word slot tree may be: the word slots are assembled in the form of trie trees.
In the process of implementing pattern matching based on the dictionary (trie) tree, each node in the dictionary (trie) tree represents a state, and the leaf nodes represent the success of template matching. The edges between the nodes represent transition conditions. If the transfer condition is successfully matched with the current character string, the node in the current state enters the node in the next state pointed by the edge (or the transfer condition) corresponding to the prefix.
Explaining the word slot tree, wherein each node in the word slot tree represents a state: the root node represents the initial state, the leaf nodes represent a successful match, and the intermediate nodes represent intermediate states of the matching process. An edge from a node to its child node is called a state transition condition. The state transition condition here is a single kanji or character in one word.
Assume that the words contained in the current dictionary are as follows:
[ D: city ] (i.e. [ D: City ], word slot representing city class words): beijing, Beijing West, Chengde
[ D: transportation ] (i.e. [ D: vehicle ], word slot representing vehicle-like words): train with a plurality of wheels
[ D: to ]: to
[ D ] is: is/are as follows
The word slot tree constructed based on the words contained in the above dictionary is as shown in fig. 3:
beijing: 1-2-7
Beijing: 1-2-7-10
Chengde: 1-3-8
To: 1-4
The following steps: 1-5
Train: 1-6-9.
Moreover, as can be seen from fig. 3, the ending leaf node further includes a word slot corresponding to the word, specifically, a word slot corresponding to a class of words.
The construction principle of the template tree is consistent with that of the word slot tree. Assume a target template is: [ D: city ] [ D: transform ] [ D: city ] [ D: of ] [ D: transform ]; one path in the trie constructed based on the target template is 1-2-5-8-10-11 shown in fig. 4. In the process of updating or constructing the template tree based on the target template, the word slot of each word type is sequentially used as a transfer condition to execute the next state node until the final end node. It should be noted that, an intention corresponding to the target template is also set at the end node, and the intention may be determined based on word slots respectively corresponding to a plurality of word types included in the target template. For example, the final intention of the target template in the node 11 in fig. 4 is — buy ticket intention interaction [ @ BUY _ TICKETS ] (buy ticket).
By adopting the scheme, the sample set can be generated based on the historical search text with the click resources, the corresponding target template and the similar words contained in the word slots in the target template are determined according to the historical search text contained in the sample set, and the template tree and the word slot tree are updated based on the target template and the similar words. Therefore, through automatically analyzing the historical search sample and the corresponding click resource thereof, the target template and the similar words in the word slot thereof are finally obtained, the influence of human factors in the processing of template generation can be reduced, the accuracy and the generation efficiency of the finally generated template are ensured, and the template tree and the word slot are updated based on the target template and the similar words, so that the applicable range of the template tree and the word slot tree can be rapidly updated, and the template tree and the word slot tree can be more accurately and efficiently identified.
Further, in the solution provided in this embodiment, the method further includes: and under the condition that the current search text is received, determining an intention recognition result corresponding to the current search text based on the word slot tree and the template tree.
The matching process for determining the intention recognition result corresponding to the current search text based on the word slot tree and the template tree is divided into two levels: the first level is matches in the template tree and the second level is matches in the lexicon.
Any node (such as a root node or a leaf node) of the template tree can trigger the root node entering the word slot tree to obtain a matching result returned by the word slot tree; and under the condition that the matching result returned by the word slot tree is matched with the transfer condition of the current node of the template tree, entering the next node of the template tree. As shown in fig. 5, a node 1, that is, a root node, of the template tree is triggered to enter the root node of the word slot tree until a first matching result returned by the word slot tree is obtained; under the condition of meeting the target transfer condition of the template tree node 1, entering a node 2 of the template tree; then triggering a root node entering a word slot tree until a second matching result returned by the word slot tree is obtained; under the condition that the matching result returned by the word slot tree at this time meets the target transfer condition of the template tree node 2, the node 3 of the template tree can be entered; the analogy is not exhaustive.
Specifically, when the current node is at the y-th node of the template tree (y is an integer greater than or equal to 1), the processing procedure may include:
entering a root node of the word slot tree, acquiring a current character to be matched from the current residual characters of the current search text, and matching based on at least one transfer condition of the current character to be matched and the root node of the word slot tree;
under the condition of matching with a target transfer condition in at least one transfer condition, entering a next node corresponding to the target transfer condition; under the condition that the next node of the word slot tree indicates that the word type of the word slot exists, the word type of the word slot is saved;
and returning the y-th node of the template tree in the case of not matching with the target transition condition in the at least one transition condition.
The ith node of the template tree may be any node in the template tree, and the processing corresponding to each node is not described herein one by one.
Further, when the node is matched with a target transfer condition in at least one transfer condition, after entering a next node corresponding to the target transfer condition, the method may further include:
taking the next node as a current node;
and obtaining the current character to be matched from the current residual characters of the current search text, and matching based on the current character to be matched and at least one transfer condition of the current node in the word slot tree.
And repeating the steps until all characters in the current search text are processed, and finally obtaining a matching result of the target path of the template tree.
It is noted that the target path of the template tree may include at least one transition state and an intention recognition result. In addition, the method can also comprise the following steps: and each transition state corresponds to a specific word.
For example, as explained in conjunction with fig. 3 and 4, assume that the current search text is: the whole matching process of the train from Beijing to Chengde is as follows:
firstly, entering a root node 1 of the template tree; entering a root node 1 of the word slot tree; transferring the condition 'north' and entering the node 2; transferring the condition 'Beijing', entering the node 7, and matching to obtain 'Beijing', namely [ D: city ]; transferring the condition 'West', entering the node 10, and matching out 'Beijing', namely [ D: city ]; "to" is not a branch condition, terminates, exits the word slot tree, returns to the template tree;
matching target transfer conditions in the template tree: the 'Beijing' is [ D: city ], entering the next node of the template tree, and the character string to be matched is 'train from West to Chengde'; entering a root node 1 of the word slot tree; if the 'west' is not a transfer condition, terminating, exiting the word slot tree, and returning to the root node 1 of the template tree;
matching target transfer conditions in the template tree: entering the next node of the template tree when Beijing is [ D: city ];
at this time, the remaining character strings to be matched of the current search word are 'trains to Chengdu'; re-entering the root node 1 of the word slot tree; matching the target transfer condition 'to' in the word slot tree, entering the next node 4, and matching out 'to', namely [ D: to ]; if the 'bearing' is not a transfer condition, terminating, exiting the word slot tree and returning to the template tree;
matching target transfer conditions in the template tree: "to" is [ D: to ], entering the next node of the template tree; at this time, the remaining character strings to be matched of the current search word are "underwriter trains".
The above process is repeated until the state transitions to the leaf node. And finally, outputting a matching result by the template tree as follows: [ D: city ]: beijing, [ D: to ]: to, [ D: city ]: chengde, [ D ] of: [ D: transfer ]: a train.
It should be added that, in the template tree and the word slot tree, a wildcard recognition function and a function recognition function may be added to meet the generalization requirement of the tool, and for example, W (wildcard) grammar and F (function) grammar may be included.
The W (wildcard) syntax is in the form: [ W: x1-x2] indicating matching any x1-x2 byte character; wherein x1 is less than x2 and is a positive integer.
For example, [ W:2-10] indicates matching any 2-10 byte character.
By using the W (wildcard character) grammar, when each node of the template tree is walked, the length of the residual character of the current search text is set as length-1, if the node has wildcard characters [ W:2-10], the length of 2-10 is intercepted, and then the nodes are traversed recursively.
The functional form in the F (function) grammar matches the character string that can be recognized by the corresponding function. Such as: num matches the numeric class text.
Using the F (function) grammar, when walking each node of the template tree, if there is wildcard [ F: func ] on this node, the text of given range length or up to the end of sentence length is intercepted, and is traversed recursively by the function func respectively. For example, a function in the wildcard [ F: func ] can be called to start from the read word to identify in the current search text, and if the identification result meets the preset function requirement, the two are determined to be matched.
According to the scheme provided by the embodiment, preliminary filtering can be performed through the template confidence of the candidate template and the word confidence of the candidate similar words to obtain a target template and target similar words; then, manual verification can be carried out, and the corresponding intention is confirmed according to the target template on the relevant information of the clicked resources from the sample set; further, a front-end display effect and an interaction style can be designed based on intents corresponding to different target templates; finally, an online test is carried out, for example, whether the requirement is met can be determined according to the number of clicks, and if the intention identification is accurate, the user can find the required target resource through fewer clicks, so that the accuracy of the template tree and the word slot tree which are updated or constructed at this time can be evaluated according to the number of clicks of the user in the process of searching for a certain time.
Therefore, by adopting the scheme, the intention of the current search text can be identified by the template tree and the word slot tree. Therefore, the template tree and the word slot tree are updated or constructed based on the target template and the target similar words, and the application range of the template tree and the word slot tree can be updated quickly, so that the template tree and the word slot tree can identify the current search text more accurately and efficiently to identify the intention.
According to a second aspect of embodiments of the present application, there is also provided a template generating apparatus, as shown in fig. 6, including:
an information obtaining module 301, configured to obtain M historical search texts and click resources corresponding to the M historical search texts, respectively; m is an integer greater than or equal to 1;
a clustering module 302, configured to cluster the M historical search texts based on the relevant information of the click resources respectively corresponding to the M historical search texts, so as to obtain N sample sets; n is an integer greater than or equal to 1;
a generating module 303, configured to determine, based on a plurality of historical search texts included in the N sample sets, target templates respectively related to the N sample sets and similar words included in word slots of the target templates.
The clustering module 302 is configured to cluster the M historical search texts based on the relevant information of the click resource respectively corresponding to the M historical search texts to obtain K candidate sample sets; k is an integer greater than or equal to N; and selecting the N sample sets from the K candidate sample sets.
The clustering module 302 is configured to perform at least one of:
counting the number of historical search texts contained in an ith candidate sample set in the K candidate sample sets, and taking the ith candidate sample set as one of the N sample sets under the condition that the number of the historical search texts reaches a first preset number; i is an integer of 1 or more and K or less;
counting the number of the historical search texts of the target type contained in the ith candidate sample set of the K candidate sample sets, and taking the ith candidate sample set as one of the N sample sets when the number of the historical search texts of the target type reaches a second preset number:
and obtaining user identifications associated with the historical search text contained in the ith candidate sample set of the K candidate sample sets, removing the duplication of the user identifications associated with the historical search text to obtain the number of the user identifications, and taking the ith candidate sample set as one of the N sample sets under the condition that the number of the user identifications reaches a third preset number.
The generating module 303 is configured to determine, based on L historical search texts included in a jth sample set of the N sample sets, a candidate template related to the jth sample set and candidate similar words included in a word slot of the candidate template; j is an integer of 1 or more and N or less; l is an integer greater than or equal to 1; selecting a candidate template with a template confidence degree larger than a template confidence degree threshold value from the candidate templates as a target template related to the jth sample set; and selecting candidate similar words with word confidence degrees larger than a word confidence degree threshold value from the candidate similar words contained in the word slot of the target template related to the jth sample set as the similar words contained in the word slot of the target template.
The generating module 303 is configured to determine a kth group of co-occurring words based on word segmentation results respectively corresponding to the L historical search texts included in the jth sample set; k is an integer of 1 or more; taking the P historical search texts containing the kth group of co-occurrence words in the jth sample set as a kth sub-sample set; p is an integer of 1 or more and L or less; determining first-class words except the kth group of co-occurring words according to word segmentation results respectively corresponding to the P historical search texts of the kth sub-sample set, and determining initial words in word slots of each candidate template in the kth group of candidate templates and the kth group of candidate templates according to the kth group of co-occurring words and the first-class words; determining the candidate similar words respectively contained in the word slots of the candidate templates in the kth group of candidate templates based on the word segmentation results respectively corresponding to the L historical search texts in the jth sample set and the initial words in the word slots of the candidate templates in the kth group of candidate templates.
On the basis of fig. 6, as shown in fig. 7, the apparatus provided in this embodiment further includes:
an updating module 304, configured to update a template tree based on the target templates respectively associated with the N sample sets; and updating the word slot tree based on similar words contained in the word slots of the target template respectively related to the N sample sets.
The device further comprises: and the intention identification module 305 is used for determining an intention identification result corresponding to the current search text based on the word slot tree and the template tree under the condition that the current search text is received.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
FIG. 8 shows a schematic block diagram of an example electronic device 500 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 8, the electronic device 500 includes a computing unit 501, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)502 or a computer program loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the device 500 can also be stored. The calculation unit 501, the ROM 502, and the RAM 503 are connected to each other by a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
A number of components in the electronic device 500 are connected to the I/O interface 505, including: an input unit 506 such as a keyboard, a mouse, or the like; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508, such as a magnetic disk, optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the electronic device 500 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.
The computing unit 501 may be a variety of general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of the computing unit 501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 50 performs the respective methods and processes described above, such as the template generation method. For example, in some embodiments, the template generation method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 500 via the ROM 502 and/or the communication unit 509. When the computer program is loaded into the RAM 503 and executed by the computing unit 501, one or more steps of the template generation method described above may be performed. Alternatively, in other embodiments, the computing unit 501 may be configured to perform the template generation method by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (17)

1. A template generation method, comprising:
acquiring M historical search texts and click resources corresponding to the M historical search texts respectively; m is an integer greater than or equal to 1;
clustering the M historical search texts based on the relevant information of the clicked resources respectively corresponding to the M historical search texts to obtain N sample sets; n is an integer greater than or equal to 1;
and determining target templates respectively related to the N sample sets and similar words contained in word slots of the target templates based on a plurality of historical search texts respectively contained in the N sample sets.
2. The method of claim 1, wherein the clustering the M historical search texts based on the information about the clicked resource corresponding to the M historical search texts, respectively, to obtain N sample sets comprises:
clustering the M historical search texts based on the relevant information of the click resources respectively corresponding to the M historical search texts to obtain K candidate sample sets; k is an integer greater than or equal to N;
and selecting the N sample sets from the K candidate sample sets.
3. The method of claim 2, wherein said selecting said N sample sets from said K candidate sample sets comprises at least one of:
counting the number of historical search texts contained in an ith candidate sample set in the K candidate sample sets, and taking the ith candidate sample set as one of the N sample sets under the condition that the number of the historical search texts reaches a first preset number; i is an integer of 1 or more and K or less;
counting the number of the historical search texts of the target type contained in the ith candidate sample set of the K candidate sample sets, and taking the ith candidate sample set as one of the N sample sets when the number of the historical search texts of the target type reaches a second preset number:
and obtaining user identifications associated with the historical search text contained in the ith candidate sample set of the K candidate sample sets, removing the duplication of the user identifications associated with the historical search text to obtain the number of the user identifications, and taking the ith candidate sample set as one of the N sample sets under the condition that the number of the user identifications reaches a third preset number.
4. The method of claim 1, wherein the determining a target template to which the N sample sets are respectively related and a homogeneous term contained in a term slot of the target template based on a plurality of historical search texts contained in the N sample sets respectively comprises:
determining candidate templates related to the jth sample set and candidate similar words contained in word slots of the candidate templates based on L historical search texts contained in the jth sample set in the N sample sets; j is an integer of 1 or more and N or less; l is an integer greater than or equal to 1;
selecting a candidate template with a template confidence degree larger than a template confidence degree threshold value from the candidate templates as a target template related to the jth sample set; and selecting candidate similar words with word confidence degrees larger than a word confidence degree threshold value from the candidate similar words contained in the word slot of the target template related to the jth sample set as the similar words contained in the word slot of the target template.
5. The method of claim 4, wherein the determining the candidate template related to the jth sample set and the candidate homogeneous terms contained in the term slot of the candidate template based on the L historical search texts contained in the jth sample set of the N sample sets comprises:
determining a kth group of co-occurrence words based on word segmentation results respectively corresponding to the L historical search texts contained in the jth sample set; k is an integer of 1 or more;
taking the P historical search texts containing the kth group of co-occurrence words in the jth sample set as a kth sub-sample set; p is an integer of 1 or more and L or less;
determining first-class words except the kth group of co-occurring words according to word segmentation results respectively corresponding to the P historical search texts of the kth sub-sample set, and determining initial words in word slots of each candidate template in the kth group of candidate templates and the kth group of candidate templates according to the kth group of co-occurring words and the first-class words;
determining the candidate similar words respectively contained in the word slots of the candidate templates in the kth group of candidate templates based on the word segmentation results respectively corresponding to the L historical search texts in the jth sample set and the initial words in the word slots of the candidate templates in the kth group of candidate templates.
6. The method of any of claims 1-5, wherein the method further comprises:
updating a template tree based on the target templates respectively related to the N sample sets; updating a word slot tree based on the homogeneous words contained in the word slots of the target template to which the N sample sets are respectively related.
7. The method of claim 6, wherein the method further comprises:
and under the condition that the current search text is received, determining an intention recognition result corresponding to the current search text based on the word slot tree and the template tree.
8. A template generation apparatus comprising:
the information acquisition module is used for acquiring M historical search texts and click resources corresponding to the M historical search texts respectively; m is an integer greater than or equal to 1;
the clustering module is used for clustering the M historical search texts based on the relevant information of the click resources respectively corresponding to the M historical search texts to obtain N sample sets; n is an integer greater than or equal to 1;
and the generating module is used for determining target templates respectively related to the N sample sets and similar words contained in word slots of the target templates based on a plurality of historical search texts respectively contained in the N sample sets.
9. The apparatus according to claim 8, wherein the clustering module is configured to cluster the M historical search texts based on the relevant information of the click resource corresponding to each of the M historical search texts, so as to obtain K candidate sample sets; k is an integer greater than or equal to N;
and selecting the N sample sets from the K candidate sample sets.
10. The apparatus of claim 9, wherein the clustering module is configured to perform at least one of:
counting the number of historical search texts contained in an ith candidate sample set in the K candidate sample sets, and taking the ith candidate sample set as one of the N sample sets under the condition that the number of the historical search texts reaches a first preset number; i is an integer of 1 or more and K or less;
counting the number of the historical search texts of the target type contained in the ith candidate sample set of the K candidate sample sets, and taking the ith candidate sample set as one of the N sample sets when the number of the historical search texts of the target type reaches a second preset number:
and obtaining user identifications associated with the historical search text contained in the ith candidate sample set of the K candidate sample sets, removing the duplication of the user identifications associated with the historical search text to obtain the number of the user identifications, and taking the ith candidate sample set as one of the N sample sets under the condition that the number of the user identifications reaches a third preset number.
11. The apparatus according to claim 8, wherein the generating module is configured to determine, based on L historical search texts included in a jth sample set of the N sample sets, a candidate template related to the jth sample set and candidate homogeneous words included in a word slot of the candidate template; j is an integer of 1 or more and N or less; l is an integer greater than or equal to 1; selecting a candidate template with a template confidence degree larger than a template confidence degree threshold value from the candidate templates as a target template related to the jth sample set; and selecting candidate similar words with word confidence degrees larger than a word confidence degree threshold value from the candidate similar words contained in the word slot of the target template related to the jth sample set as the similar words contained in the word slot of the target template.
12. The apparatus according to claim 11, wherein the generating module is configured to determine a kth group of co-occurring words based on word segmentation results corresponding to the L historical search texts included in the jth sample set respectively; k is an integer of 1 or more; taking the P historical search texts containing the kth group of co-occurrence words in the jth sample set as a kth sub-sample set; p is an integer of 1 or more and L or less; determining first-class words except the kth group of co-occurring words according to word segmentation results respectively corresponding to the P historical search texts of the kth sub-sample set, and determining initial words in word slots of each candidate template in the kth group of candidate templates and the kth group of candidate templates according to the kth group of co-occurring words and the first-class words; determining the candidate similar words respectively contained in the word slots of the candidate templates in the kth group of candidate templates based on the word segmentation results respectively corresponding to the L historical search texts in the jth sample set and the initial words in the word slots of the candidate templates in the kth group of candidate templates.
13. The apparatus of any one of claims 8-12, wherein the apparatus further comprises:
an updating module, configured to update a template tree based on the target templates respectively associated with the N sample sets; updating a word slot tree based on the homogeneous words contained in the word slots of the target template to which the N sample sets are respectively related.
14. The apparatus of claim 13, wherein the apparatus further comprises:
and the intention identification module is used for determining an intention identification result corresponding to the current search text based on the word slot tree and the template tree under the condition that the current search text is received.
15. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.
16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.
17. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-7.
CN202011556696.8A 2020-12-24 2020-12-24 Template generation method and device, electronic equipment and storage medium Active CN112560425B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011556696.8A CN112560425B (en) 2020-12-24 2020-12-24 Template generation method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011556696.8A CN112560425B (en) 2020-12-24 2020-12-24 Template generation method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112560425A true CN112560425A (en) 2021-03-26
CN112560425B CN112560425B (en) 2024-04-09

Family

ID=75034066

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011556696.8A Active CN112560425B (en) 2020-12-24 2020-12-24 Template generation method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112560425B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114444514A (en) * 2022-02-08 2022-05-06 北京百度网讯科技有限公司 Semantic matching model training method, semantic matching method and related device

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095187A (en) * 2015-08-07 2015-11-25 广州神马移动信息科技有限公司 Search intention identification method and device
CN109063221A (en) * 2018-11-02 2018-12-21 北京百度网讯科技有限公司 Query intention recognition methods and device based on mixed strategy
CN110059163A (en) * 2019-04-29 2019-07-26 百度在线网络技术(北京)有限公司 Generate method and apparatus, the electronic equipment, computer-readable medium of template
CN110245348A (en) * 2019-05-17 2019-09-17 北京百度网讯科技有限公司 A kind of intension recognizing method and system
CN111444722A (en) * 2020-03-06 2020-07-24 中国平安人寿保险股份有限公司 Intent classification method, device, equipment and storage medium based on voting decision
CN111488426A (en) * 2020-04-17 2020-08-04 支付宝(杭州)信息技术有限公司 Query intention determining method and device and processing equipment
WO2020161505A1 (en) * 2019-02-08 2020-08-13 All Street Research Limited Improved method and system for text based searching
CN111831821A (en) * 2020-06-03 2020-10-27 北京百度网讯科技有限公司 Training sample generation method and device of text classification model and electronic equipment
CN111950254A (en) * 2020-09-22 2020-11-17 北京百度网讯科技有限公司 Method, device and equipment for extracting word features of search sample and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095187A (en) * 2015-08-07 2015-11-25 广州神马移动信息科技有限公司 Search intention identification method and device
CN109063221A (en) * 2018-11-02 2018-12-21 北京百度网讯科技有限公司 Query intention recognition methods and device based on mixed strategy
WO2020161505A1 (en) * 2019-02-08 2020-08-13 All Street Research Limited Improved method and system for text based searching
CN110059163A (en) * 2019-04-29 2019-07-26 百度在线网络技术(北京)有限公司 Generate method and apparatus, the electronic equipment, computer-readable medium of template
CN110245348A (en) * 2019-05-17 2019-09-17 北京百度网讯科技有限公司 A kind of intension recognizing method and system
CN111444722A (en) * 2020-03-06 2020-07-24 中国平安人寿保险股份有限公司 Intent classification method, device, equipment and storage medium based on voting decision
CN111488426A (en) * 2020-04-17 2020-08-04 支付宝(杭州)信息技术有限公司 Query intention determining method and device and processing equipment
CN111831821A (en) * 2020-06-03 2020-10-27 北京百度网讯科技有限公司 Training sample generation method and device of text classification model and electronic equipment
CN111950254A (en) * 2020-09-22 2020-11-17 北京百度网讯科技有限公司 Method, device and equipment for extracting word features of search sample and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ZONGYI (JOE) LIU.ETC: "A Scalable Automated System to Measure User Experience on Smart Devices", 《2019 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS》, vol. 2019, 13 January 2019 (2019-01-13) *
王忠群: "基于模板用户信息搜索行为和统计分析的共谋销量欺诈识别", 《现代图书情报技术》, no. 11, 30 November 2015 (2015-11-30), pages 41 - 50 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114444514A (en) * 2022-02-08 2022-05-06 北京百度网讯科技有限公司 Semantic matching model training method, semantic matching method and related device
CN114444514B (en) * 2022-02-08 2023-01-24 北京百度网讯科技有限公司 Semantic matching model training method, semantic matching method and related device

Also Published As

Publication number Publication date
CN112560425B (en) 2024-04-09

Similar Documents

Publication Publication Date Title
CN111967262A (en) Method and device for determining entity tag
CN112084381A (en) Event extraction method, system, storage medium and equipment
CN112395385B (en) Text generation method and device based on artificial intelligence, computer equipment and medium
CN111444330A (en) Method, device and equipment for extracting short text keywords and storage medium
CN111177532A (en) Vertical search method, device, computer system and readable storage medium
CN112989208B (en) Information recommendation method and device, electronic equipment and storage medium
CN113177412A (en) Named entity identification method and system based on bert, electronic equipment and storage medium
CN110990532A (en) Method and device for processing text
CN112926308B (en) Method, device, equipment, storage medium and program product for matching text
CN112925883B (en) Search request processing method and device, electronic equipment and readable storage medium
CN113836925A (en) Training method and device for pre-training language model, electronic equipment and storage medium
CN114495143A (en) Text object identification method and device, electronic equipment and storage medium
CN111178080B (en) Named entity identification method and system based on structured information
CN113836316B (en) Processing method, training method, device, equipment and medium for ternary group data
CN112699237B (en) Label determination method, device and storage medium
CN112948573B (en) Text label extraction method, device, equipment and computer storage medium
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
CN112560425B (en) Template generation method and device, electronic equipment and storage medium
CN112541070A (en) Method and device for excavating slot position updating corpus, electronic equipment and storage medium
CN114647727A (en) Model training method, device and equipment applied to entity information recognition
CN110941713A (en) Self-optimization financial information plate classification method based on topic model
CN112307183B (en) Search data identification method, apparatus, electronic device and computer storage medium
CN114417862A (en) Text matching method, and training method and device of text matching model
CN115248890A (en) User interest portrait generation method and device, electronic equipment and storage medium
CN113807091A (en) Word mining method and device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant