CN112685440B

CN112685440B - Structural query information expression method for marking search semantic role

Info

Publication number: CN112685440B
Application number: CN202011640600.6A
Authority: CN
Inventors: 王程
Original assignee: Shanghai Xinzhaoyang Information Technology Co ltd
Current assignee: Shanghai xinzhaoyang Information Technology Co.,Ltd.
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2022-03-22
Anticipated expiration: 2040-12-31
Also published as: CN112685440A

Abstract

The invention relates to a structural query information expression method for marking search semantic roles, which takes a search query input by a user as a word sequence, establishes model analysis on the sequence to analyze user behaviors, integrates calculation science, cognitive science and psychology, and analyzes real search intention behind the user behaviors by establishing a model on the input sequence; the method for expressing the natural language text input by the user into the structured query information is successful practice in the fields of structured query information extraction and structure prediction and can be popularized in other fields such as natural language processing, data mining and the like; based on a semi-supervised learning method, machine learning and artificial experience are combined together, the cost that a large number of samples need to be marked manually in the supervised learning method is reduced, and a reasonable explanation is given to a result set; the search engine is helped to analyze the search intention of the user, and the search experience of the user and the conversion rate of the commodities are improved.

Description

Structural query information expression method for marking search semantic role

Technical Field

The invention relates to a structured query information expression method, in particular to a structured query information expression method for marking search semantic roles, and belongs to the technical field of structured information retrieval.

Background

The information retrieval is to analyze and model the process of inquiring information by people, and design a computer algorithm to automatically execute the inquiry so as to analyze the information required by a user. One of the key problems of information retrieval is relevance, which refers to whether a search result fed back by a search engine matches with the real search requirement of a user, that is, whether the search requirement of the user can be met, and relevance also directly relates to the conversion rate of goods in the fields of e-commerce and the like.

If a simple full-text string comparison is performed on a query and a text, such as a text search or database system tool in Unix, and an exact match is sought, the returned result usually cannot meet the user's requirements, and one obvious reason is that: the same concept can be expressed by different words, and the same word can express different concepts in different language environments, which is called as a word list mismatching problem in information retrieval; on the other hand, the search query words input by the user have certain emphasis points and implicitly reflect the requirements and personal preferences of the user. In the E-commerce vertical search engine, the measure of relevance is extremely important, and the measure directly relates to the search satisfaction of the user and the conversion rate of commodities. In the prior art, a plurality of retrieval models are proposed in sequence, one retrieval model is a formal representation of the matching process of a search query input by a user and a text in a database, and is the basis of a sorting algorithm, a search engine retrieves data stored in the database by using the retrieval model and returns an ordered list of information, a good retrieval model can find the text related to a questioner and sorts the text according to the relevance, the information which can best meet the requirements of the user is arranged in front, in the information retrieval, most of the retrieval models only carry out simple character string statistics on the text and do not care about the structure of the language, the retrieval model can lead the result fed back by the search engine to have larger deviation, and the deviation is more obvious in vertical search engines such as electronic commerce and the like.

The prior art search models, such as a space vector model, a BM25 model, a query likelihood model, etc., mostly rely on bag of words, and are a simple representation of texts, in these search models, texts are regarded as a set of unordered words, the syntax or context of the whole texts is not reflected, however, from the linguistic point of view, a piece of text follows a specific syntax and grammatical structure and each word is closely related to a specific context (context), so the representation capability of the bag of words is very limited. The current goal is to break the constraints of the bag-of-words model, analyze the internal structure of the text and build a conventional search model that can handle both structured and unstructured data. The research text structure is a key part of network search, and the structured analytic method is to structurally express the natural language text, so that the search intention of the user can be more accurately identified.

The structured search is based on extracting structured information from search queries input by users, effectively matching the structured information with background texts, and storing more background texts (commodities) in a structured or semi-structured manner in vertical search engines such as e-commerce and the like, so that the application of the structured search has inherent advantages, and the structured analysis of the search queries input by the users so as to deeply analyze the query intentions of the users has great application value.

In summary, there still exist many deficiencies in the query information expression in the prior art, and the difficulties in the prior art and the problems solved by the present invention mainly focus on the following aspects:

firstly, in a text search or database system tool in the prior art, simple full-text character string comparison is performed on a query and a text, a returned result generally cannot meet the requirements of a user, the problem of mismatching of information retrieval word lists exists, in addition, a search query word input by the user has a side focus, most of retrieval models which implicitly reflect the requirements of the user and personal preference usually only perform simple character string statistics on the text and do not care about the structure of the internal language, the retrieval model can cause great deviation on the relevance of the result fed back by a search engine, and the deviation is more obvious in vertical search engines such as e-commerce and the like;

secondly, the prior art mostly depends on word bags and is a simple representation mode of texts, in the retrieval models, texts are taken as a set of unordered words, the integral syntax or context relationship of the texts is not reflected, the representation capability of the word bags is very limited, the prior art cannot break the constraint of the word bag model, analyze the internal structure of the texts and establish a conventional retrieval model capable of processing structured and unstructured data, the search intention of a user cannot be accurately identified, the current search experience cannot meet the acquisition requirement of the user on information, the search enthusiasm of the user and the whole user viscosity are contused, and the benign development of a website platform is not facilitated;

third, the prior art has the following difficulties and disadvantages in identifying query core words: firstly, the query text is short in length and belongs to entity recognition at a statement level, and the traditional named entity recognition focuses more on analysis at a chapter level, so that the effect of the prior text analysis technology (such as lexical analysis and syntactic analysis) on recognizing query core words is not ideal; secondly, the query text structure is not strict, a large amount of non-standard expressions exist, and the generalization and standardization processing of data are difficult; thirdly, the conventional named entity recognition technology is to recognize a specific entity in a specific text, and only one core word can reflect the search purpose of a user in query, and word context information needs to be deeply mined; fourthly, only identifying entities in the text by named entity identification, wherein the key components which can reflect the search intention of the user most in the search need to be identified by core word identification in the query, and the key components are assigned to specific categories;

fourth, the prior art has the following difficulties and disadvantages in extracting structured query information: firstly, the query text is not standard, and a large amount of generalization and standardization work is required in the semantic role marking process; secondly, the query text structure is not strict, the existing search engine conducts retrieval based on a word bag model, the search query input by a user is guided to be a pile of some keywords, a lot of queries do not follow the syntactic rule or even do not form a sentence completely, and the text analysis technology in the prior art is not completely applicable; thirdly, due to the diversity of text expression, many examples have the phenomenon of one-word-polysemous or one-word-polysemous, which brings great difficulty to the classification of semantic units;

fifthly, the semi-supervised or unsupervised learning method in the prior art is applied to the field of natural language processing, the commonly used semi-supervised learning method is a self-learning method, the self-learning methods are summarized of actual experience, lack of theoretical basis and cannot achieve good effect on many problems, and a potential mode that the state sequence of unmarked data is unknown and all the unmarked data cannot be covered by a small amount of marked data exists. Due to the lack of a large amount of manual marking data, the condition random domain model based on supervised learning in the prior art cannot well solve the problem of marking and searching semantic roles.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides the structural query information expression method for marking the semantic role of search, which deeply analyzes the search query input by the user, analyzes the search intention of the user and more effectively and accurately feeds back the information required by the user. The natural language text input by the user is expressed in a structured form, so that the search intention and personal preference of the user can be deeply analyzed, the number of times of query input by the user is reduced, and the search path of the user is shortened.

In order to achieve the technical effects, the technical scheme adopted by the invention is as follows:

the structured query information expression method for marking the semantic role of search further extracts structured query information from the search query input by a user, expresses a natural language text into structured data, accurately analyzes the search intention of the user and improves the search satisfaction of the user; the invention is based on the potential semantic structure of the query and carries out formalized representation on the extraction of the search structural query information, and provides the concept of marking the search semantic role and gives the complete definition: representing a search query input by a user into a structured data format dominated by core words, and marking the core words and semantic argument dominated by the core words in the search query;

the method mainly comprises the steps of marking a structure of the semantic search role, identifying a query core word based on a semi-supervised condition random domain, and extracting structural query information based on the semi-supervised condition random domain, and specifically comprises the following steps:

firstly, the architecture for marking the semantic role of search carries out deep analysis on the search query input by a user, divides the search query input by the user into a plurality of independent semantic units and distributes the semantic units to preset semantic categories, and adopts a progressive mode to realize the method for marking the semantic role of search, which comprises two parts: firstly, identifying key components, namely core words, of a user input query, wherein the key components directly represent the real search query intention of the user, and when deep structural analysis cannot be performed on the search query, the core words ensure that the relevance is in a controllable range; secondly, deep-level analysis is carried out on the search query input by the user, structured information is extracted from the search query input by the user, and the real search intention and potential requirements of the user are identified;

secondly, based on the identification of the query core words of the semi-supervised conditional random domain, establishing a model for the search query input by a user, identifying and classifying the core words in the search query, deducing the generation process of a query sentence from the probability angle, establishing the model by adopting a three-layer Bayes semi-supervised probability model, regarding the core words in the search query as texts, regarding the context information of the core words as the words forming the texts, regarding the categories of the core words as the subjects, and mining and classifying the core words by adopting the semi-supervised conditional random domain model;

thirdly, extracting the structured query information based on the semi-supervised condition random domain, adopting a semi-supervised condition random domain model to extract the structured query information, expressing the natural language text input by the user into structured query data, firstly providing a semi-automatic marking method to carry out prepositive marking on a large number of queries, then comprehensively adopting a small amount of manual marking data and a large amount of semi-automatic marking data to train the model and mark unmarked data, and adopting the semi-supervised condition random domain method to train the model to carry out the structured query information extraction on the unmarked data.

The structural query information expression method for marking the semantic role of search, further, the semantic role mark is a method for marking a predicate in a sentence and other components governed by the predicate, and deeply analyzing the structure of the sentence so as to analyze the semantic level, the semantic role mark identifies the predicate in a sentence and other semantic arguments governed by the predicate, the marking search semantic role automatically marks each semantic role in the search query, the structure of the query is analyzed so as to analyze the search intention of a user deeply, the query sentence is governed by a core word, and the other components in the query are subordinate to the core word;

the definition of the tag search semantic role is: representing a search query input by a user into a structured data format governed by core words, marking the core words and other semantic arguments governed by the core words in the search query, wherein the formalization definition is as follows:

p→{ProWord；SeUnit₁,SeUnit₂,…,SeUnit_n}

where p represents a search query entered by a user, Proword represents a core term in the query, SeUnit_iRepresenting the semantic units to be marked, and n representing the number of defined semantic units.

The structural query information expression method of the mark search semantic role further introduces a topic model: when judging the text relevance, not only the co-occurrence condition of words is considered, but also the deep level semantics expressed by the text are considered, the invention introduces a theme model for semantic analysis, the theme in the theme model is expressed as a group of generalized expression forms with the same concept, and the generation process of the text is explained by using a generation model: a text contains a plurality of topics, each topic selects a plurality of vocabularies according to probability, and the generation process of the text is represented as follows:

q (vocabulary | text) ═ Σ topic q (word | topic) × q (topic | text)

Matrix form of topic model: wherein the matrix on the left of the equation represents the word frequency of each word in each text, i.e., the probability of occurrence of a word; the first matrix on the right side of the equation represents the occurrence probability of each word in each topic; the second matrix on the right side of the equation represents the probability of different topics in each text, a series of texts are given, the texts are subjected to preprocessing in advance, then the frequency of word occurrence in each text is counted to obtain a text-word matrix on the left side, and a topic model is used for decomposing the matrix on the left side and learning two matrices on the right side;

the conditional random domain topic model builds a model for the topic implied by the characters, the texts expressing the same semantic topic are gathered together from massive texts, a three-layer Bayes semi-supervised probability model is adopted to identify and query core words, the core words correspond to the texts, the context information of the core words corresponds to the words in the texts, and the category of the core words corresponds to the topic.

The structural query information expression method of the mark search semantic role further comprises the following steps of topic model derivation: the query formalized representation containing single core word information is a triple tree (q, r, s), wherein q (Proword) represents the core word in the query, r represents the context information of the core word in the query, s represents the category information of the core word, and the identification problem of the core word in the query aims to identify the core word in the queryThe core word q and belongs q to the most likely class s, the problem is transformed to find the most probable triple three (q, r, s) from all possible triples^*：

(q,r,s)^*＝argmax_(q,r,s)Qr(p,q,r,s)

＝argmax_(q,r,s)Qr(p|q,r,s)Qr(q,r,s)

＝argmax_{(q,r,s)∈F(p)}Qr(q,r,s)

The conditional probability Qr (p | q, r, s) represents the probability that the triplet three (q, r, s) generates the query p, a given triplet three (q, r, s) generates a unique query, Qr (p | q, r, s) can only be 0 or 1 for a given query p and triplet three (q, r, s), i.e., there are only two possibilities: the triple three (q, r, s) generates a query p or the triple three (q, r, s) cannot generate a query p, and f (p) is defined as the set of all triples that can generate a query p, that is, Qr (p | q, r, s) ═ 1, (q, r, s)^*Certainly in f (p), the core word recognition problem in the query can be simplified to find its joint probability Qr (p | q, r, s) for any triple in f (p):

Qr(q,r,s)＝Qr(q)Qr(s|q)Qr(r|q,s)

＝Qr(q)Qr(s|q)Qr(r|s)

in the equation, assuming that Qr (r | q, s) ═ Qr (r | s), the problem of core word recognition in the query of the present invention further evolves to estimate Qr (q), Qr (s | q), and Qr (r | s), which contain a large amount of core words and context information.

The structural query information expression method of the mark search semantic role further comprises a semi-supervised condition random domain model: let data set R { (q) } {_i,r_i,s_i)|i＝1,…,N}，(q_i,r_i,s_i) The method comprises the steps of querying a triple corresponding to p, N is the scale of a data set, and formalized expression of a core word recognition problem in query is as follows:

if each core word belongs to a single category, an optimization target is constructed according to the formula, and a data set R { (q) is constructed_i,r_i) The category information s corresponding to the core word is divided into a plurality of categories_iAs hidden variables, the optimization goal of the problem becomes the following:

wherein Qr (q)_i) Represents the core word q_iProbability of occurrence, Qr (s | q)_i) Represents the core word q_iProbability of belonging to class s, Qr (r)_i| s) represents context information r under category s_iProbability of occurrence, probability Qr (q)_i) Independent of Qr (s | q)_i) And Qr (r)_iIs) statistically derived from the dataset, assuming Qr (q)_i) Is estimated as Pr^*(q_i) Then, the above formula is expressed as:

the solution of the problem becomes the probability estimation problem of the above formula, which is expressed as a topic model in form, the core word corresponds to the text, the context information of the core word corresponds to the word of the text, the category information corresponds to the topic thereof, the invention adopts a conditional random domain topic model, the conditional random domain model adopts a semi-supervised mode to learn, namely SS-LDA described in the invention, the topic (category) is agreed in advance, and the topic (category) of each text (core word) is marked in the training data set.

The structural query information expression method of the mark search semantic role further comprises the following steps of: the method adopts SS-LDA and a training data set to construct a query core word recognition system, and comprises three modules: the system comprises a data preprocessing module, an offline training module and an online marking module;

data preprocessing: the search query input by the user is normalized and standardized, the normalization processing is to filter the search query input by the user, filter messy codes, redundant spaces and rab keys, remove stop words, facilitate the subsequent further processing of the query words by the preprocessing, the normalization processing is to take a word root operation, and restore the initial form of a single query word;

off-line training: the method is a process for solving parameters by a data mining and parameter learning method, and comprises the steps of firstly selecting core words from a training data set as seeds, marking the corresponding class information, and then scanning the data set by using the seed core words to obtain a training data set (q)_i,r_i) Training a topic model by SS-LDA, estimating Qr (s | q) for each seed core word, obtaining Qr (r | s) for each category, scanning the data set again to obtain all queries containing s, taking the part with the context information s removed as a new core word, updating Qr (q | s) by SS-LDA again for the newly extracted core word, updating the probability Qr (q) of the newly extracted product q in the step, estimating Qr (q) by using the frequency of the core word q in the data set, namely the higher the frequency of the core word q, the higher the probability Qr (q), solving the Qr (q) and the Qr (r | s) required in the model by the steps, and storing the probability obtained from off-line so as to effectively predict on-line;

marking on the line: and (3) trying to solve the triple three (q, r, s) with the maximum probability in F (p) for the search query input by the user, segmenting the query into all combinations of core words and context information, marking the corresponding core words as corresponding categories to generate F (p), and calculating the joint probability Qr (q, r, s) of any triple three (q, r, s) in F (p), wherein the triple with the maximum probability value is output as a result.

The structural query information expression method of the mark search semantic role further comprises the following steps of formally expressing a mark problem: the input of the labeling problem is a known observation sequence, the output is a hidden labeling sequence or a state sequence, the labeling problem learns a model from a training sample so that the model can give the correct labeling sequence to a new observation sequence, the labeling problem is divided into two processes of learning and labeling, and a training data set is given firstly:

R＝{(x₁,y₁),(x₂,y₂),…,(x_n,y_n)}

wherein x is_i＝{x_i ⁽¹⁾,x_i ⁽²⁾,…,x_i ⁽ⁿ⁾1,2, …, n is the observation sequence, y_i＝(y_i ⁽¹⁾,y_i ⁽²⁾,…,y_i ⁽ⁿ⁾) Is a corresponding marker sequence (state sequence), n represents the length of the observation sequence, the learning system learns a model from the training data set, the whole process is represented by conditional probability distribution:

Q(Y⁽¹⁾，Y⁽²⁾，…，Y⁽ⁿ⁾|X⁽¹⁾，X⁽²⁾，…，X⁽ⁿ⁾)

wherein each X is⁽ⁱ⁾(i-1, 2, …, n) is taken to all possible observations, each Y being⁽ⁱ⁾(i ═ 1,2, …, n) takes the values of all possible labels, and the labeling system finds the corresponding state sequence as output for the new input observation sequence according to the learned conditional probability distribution model, specifically: for one observation sequence:

finding a conditional probability:

the largest tag sequence:

the invention relates to a semantic role labeling and searching method, which is a typical labeling problem and is solved by adopting a sequence labeling model, in particular to a semantic role labeling and searching method by adopting a semi-supervised conditional random domain model.

The structural query information expression method of the mark search semantic role further comprises the following steps of: the conditional random field is randomly varied given a random variable XThe method comprises the steps of measuring a Markov random field of Y, calculating a conditional probability model Q (Y/X) of a mark sequence by a linear chain random field which is a given observation sequence, wherein Y is an output variable and represents the mark sequence, X is an input variable and represents the observation sequence needing to be marked, and obtaining the conditional probability model Q of a training data set by utilizing the training data set through maximum likelihood estimation or regularized maximum likelihood estimation in a learning process^*(Y | X); the prediction process is to find the conditional probability Q for a given observation sequence x from the learned model^*(y | x) maximum state sequence y^*；

Conditional random field definition: assuming that X and Y are random variables, Q (Y | X) is the conditional probability distribution of Y given X, and if the random variable Y constitutes a markov random field represented by an undirected graph F ═ U, B, that is:

Q(Y_U|X，Y_K，k≠u)＝Q(X，Y_K，k～u)

when an arbitrary node U is satisfied, the conditional probability distribution Q (Y | X) is a conditional random field, where k to U represent all nodes k that are connected to the node U with an edge in the graph F ═ (U, B), k ≠ U represents all nodes other than the node U, and Y represents all nodes other than the node U_UAnd Y_KAssuming that X and Y have the same graph structure for the random variables corresponding to nodes u and k, the linear chain case of the directed graph is:

F＝(U＝{1,2,…,n},B＝{i,i+1})

wherein i is 1,2, …, n-1, X is (X)₁,X₂,…X_n)，Y＝(Y₁,Y₂,…Y_n) The maximum clique is the set of two adjacent nodes.

The structural query information expression method of the mark search semantic role further establishes a sequence mark model: the method comprises the steps of segmenting semantic units of search queries input by users, and attributing each semantic unit to a preset category, wherein the problem of marking search semantic roles is solved by adopting a sequence marking model;

the input of the tag search semantic role framework includes two types of data: a small amount of manually marked data, a large amount of semi-automatically marked data, and a semantic markerThe two types of resources are obtained by training and learning, and n marked training data are expressed as (x)⁽ⁱ⁾,y⁽ⁱ⁾) I is 1,2, … n, where x⁽ⁱ⁾Denotes the observation sequence, y⁽ⁱ⁾Representing a marker sequence, q (×) is a probability function, and the goal of model training is to find the optimal parameter vector h^*So that it satisfies:

after the model training is finished, a semantic marker is obtained, and for a given input sequence x, a corresponding output sequence y is obtained^*：

y^*＝argmax_yq(y|x；h)

Training samples are input into a semantic marker model, and a marker search semantic role framework is output.

The structural query information expression method of the mark search semantic role further comprises a structural query information extraction model: pre-marking units in inquiry by adopting a semi-automatic marking method, training a conditional random domain model by utilizing more potential information, semi-automatically marking a data set by adopting a relational data table, and completing semi-automatic marking of training data by adopting click log information of a user;

the present invention is defined as follows:

first, manually labeling a data set: formalized by artificially labeled training data set as y ═ y (y)₁,y₂,…,y_R)；

Second, semi-automatic labeling of datasets: the training data set labeled by the semi-automatic labeling method using additional resources is a set of labeled units in the query, formalized as z ═ z1, z2, …, z_R)；

The semi-automatic marked data set has a supplementary effect on the manually marked data set, and the problem that the manually marked data set cannot cover all modes of unmarked data is solved, and the semi-automatic marked data set mainly uses two types of data sets: firstly, a small number of manual marking data sets and secondly, a large number of semi-automatic marking numbersAnd (3) a data set is utilized to learn a conditional random domain model, only partial semantic units of the semi-automatic labeled data set are labeled, and the following assumptions are made: if y is_r＝z_rThen the variable is taken as the observed variable, otherwise the variable is taken as the hidden variable.

Compared with the prior art, the invention has the following contributions and innovation points:

first, the structural query information expression method of the mark search semantic role mainly contributes to: the invention regards the search query input by the user as a word sequence from the perspective of computer language, establishes model analysis on the sequence to analyze the user behavior, fuses the calculation science, the cognitive science and the psychology, and analyzes the real search intention behind the user behavior by establishing a model on the input sequence; secondly, the search query input by the user is subjected to integral structural analysis, a method for expressing the natural language text input by the user into structured query information is provided, the successful practice in the fields of structured query information extraction and structure prediction is realized, and the method can be popularized in other fields such as natural language processing, data mining and the like; the method is based on a semi-supervised learning method, and combines machine learning and artificial experience together, so that on one hand, the semi-supervised learning can reduce the cost of manually marking a large number of samples by the supervised learning method, and on the other hand, a relatively reasonable explanation can be given to a result set (the interpretability of unsupervised learning is poor); the method can help the search engine to analyze the search intention of the user, thereby improving the search experience of the user and the conversion rate of commodities, realizing structured search, and having very high practical value and wide application prospect;

secondly, the invention belongs to the hotspot problem of user search intention identification, realizes the deep analysis of the search query input by the user, analyzes the search intention of the user, and feeds back the information required by the user more effectively and accurately. The natural language text input by the user is expressed in a structured form, so that the search intention and personal preference of the user can be deeply analyzed, the number of times of query input by the user is reduced, and the search path of the user is shortened;

thirdly, the method for marking the semantic role of the search has great practical significance, on one hand, valuable query semantic information is provided for a shopping search engine, important parameters are provided for the retrieval and the sequencing of the search engine, and the search experience of a user and the conversion rate of commodities are favorably improved; on the other hand, the mark search semantic role can effectively promote the page advertisement to bring the income. The innovativeness of the invention is mainly further embodied in the following aspects: firstly, a formalized representation form of a semantic role of label search is provided so as to carry out mathematical modeling on the problem; deducing a generation process of a query statement from a probability angle, formally representing the problem into an optimization problem, establishing a model for the problem by adopting a three-layer Bayes semi-supervised probability model, and verifying the effectiveness of the method through experiments; thirdly, in order to relieve the situation that the number of marked samples is small, the invention provides semi-automatic marked data, a large amount of queries are pre-marked by adopting background structured data, then a small amount of artificial marked samples and a large amount of semi-automatic marked data are combined, a semi-supervised condition random domain model is adopted to extract structured query information, and a natural language text input by a user is expressed into structured data;

fourthly, the invention carries out deep analysis on the search query input by the user, divides the search query input by the user into a plurality of independent semantic units and distributes the semantic units to preset semantic categories, and adopts a progressive mode to realize the method for marking the semantic role of the search, which comprises two parts: firstly, identifying key components, namely core words, of a user input query, wherein the key components directly represent the real search query intention of the user, and when deep structural analysis on the search query cannot be carried out, the core words can ensure that the relevance is in a controllable range; secondly, deep-level analysis is carried out on the search query input by the user, structured information is extracted from the search query input by the user, and the real search intention and potential requirements of the user are identified;

fifthly, the invention further discusses the problem of marking the semantic role of search, extracts the structured query information from the search query input by the user and marks the semantic role of the structured query information. The background of the vertical website is some semi-structured information, when a user inputs a query, the structured query information is extracted, and the structured information is matched with the background information, so that the search experience of the user is improved. Aiming at the problems that a large amount of manpower is consumed for manual marking of data in the prior art, and data marking is possibly inconsistent, the invention adopts a semi-supervised condition random domain model to solve the problems, and the main contributions are as follows: firstly, a semi-automatic marking method for inquiring based on a user click log and a domain knowledge base is provided, and secondly, a semantic role marking and searching method based on a semi-supervised condition random domain is provided;

sixth, the invention discloses a semi-automatic marking method for pre-marking the search query input by the user, then comprehensively adopting manual marking data and semi-automatic marking data, adopting a semi-supervised condition random domain method training model for carrying out structured query information extraction on unlabelled data, and adopting a small amount of manually marked data and a large amount of semi-automatically marked data for training the semi-supervised condition random domain model, thereby relieving the difficulty of manually marking the data and verifying the superiority of the invention through experiments.

Drawings

FIG. 1 is a probability map model diagram of the conditional random domain model of the present invention.

FIG. 2 is a schematic diagram of three representation levels of a text generated by a conditional random field according to the present invention.

FIG. 3 is a flow chart of a query core word recognition method according to the present invention.

FIG. 4 is a schematic diagram of a query semantic role tagging framework of the present invention.

FIG. 5 is a flow chart of the method for extracting the structured query information according to the present invention.

Detailed Description

The technical solution of the method for expressing structured query information of a tagged search semantic role according to the present invention is further described below with reference to the accompanying drawings, so that those skilled in the art can better understand the present invention and can implement the same.

In internet search engines, user-entered search queries are often directed to structured data such as e-commerce platform merchandise searches, flights, movie showtimes, and the like. But since the search query entered by the user is represented in the form of natural language text, it is difficult to return relevant results from such structured data. If structured query information can be extracted from the search query input by the user, the natural language text is expressed into structured data, so that the search intention of the user can be analyzed more accurately, and the search satisfaction of the user is improved. The invention is based on the potential semantic structure of the query and carries out formalized representation on the extraction of the search structural query information, and provides the concept of marking the search semantic role and gives the complete definition: the method comprises the steps of representing a search query input by a user into a structured data format governed by core words, and marking out the core words and semantic arguments governed by the core words in the search query.

The semantic role of label search analyzes the search query input by the user from the structural characteristics of the sentence, understands and grasps the search intention of the user, and mainly comprises the following steps:

firstly, based on semantic role marks, providing complete definitions of mark search semantic roles and research ranges thereof;

secondly, identifying and identifying problems based on the core words of the search query, establishing a model for the search query input by a user, identifying and classifying the core words in the search query, deducing a generation process of a query sentence from a probability angle, establishing the model by adopting a three-layer Bayes semi-supervised probability model, regarding the core words in the search query as texts, regarding context information of the core words as words forming the texts, and regarding the category of the core words as a theme;

thirdly, extracting structured query information by adopting a semi-supervised conditional random domain model, and expressing a natural language text input by a user into structured query data;

fourthly, comparing various conditions of different models and different feature spaces on the real data set, and the results and the analysis thereof prove the superiority of the method in the structural query information expression of the mark search semantic role.

Framework for marking semantic role of search

The invention relates to a method for marking a search semantic role, which is a basis of structured search, deeply analyzes a search query input by a user, divides the search query input by the user into a plurality of independent semantic units and distributes the semantic units to preset semantic categories, and adopts a progressive mode to realize the method for marking the search semantic role, and comprises two parts: firstly, identifying key components, namely core words, of a user input query, wherein the key components directly represent the real search query intention of the user, and when deep structural analysis on the search query cannot be carried out, the core words can ensure that the relevance is in a controllable range; and secondly, deep-level analysis is carried out on the search query input by the user, structured information is extracted from the search query input by the user, and the real search intention and potential requirements of the user are identified.

Concept of (I) tag search semantic role

Semantic role tagging is a method for deeply parsing the structure of a sentence to perform semantic analysis, and is a method for tagging predicates and other components (subjects, objects, and the like) governed by the predicates in a sentence, and tagging semantic roles for automatically tagging semantic roles in a search query, analyzing the structure of the query to deeply parse the search intention of a user, wherein the search query is a stack of some keywords and does not include predicates or other language components, but is governed by core words, and other components in the query are subordinate to the core words.

Based on the concept of semantic role marking, the definition of marking and searching semantic roles is as follows: representing a search query input by a user into a structured data format governed by core words, marking the core words and other semantic arguments governed by the core words in the search query, wherein the formalization definition is as follows:

p→{ProWord；SeUnit₁,SeUnit₂,…,SeUnit_n}

(II) identifying query core words

The method for identifying the query core words is characterized in that the entity which can best represent the search intention of a user is identified from the search query input by the user, the core words from the linguistic perspective are the most important semantic units of the search query input by the user, and the core word identification comprises two aspects: the method comprises the steps of firstly, extracting core words in search queries, namely extracting the core words from the search queries input by users; classifying the core words, namely dividing the extracted core words into specific categories; the recognition of core words in a search query is expressed in a formal language as:

inquiry→{ProWord,Class_ProWord}

wherein, ProWord represents the core word which can represent the searching intention of the user in the query, Class_ProWordRepresenting the category corresponding to the core word ProWord.

The prior art has the following difficulties and disadvantages in identifying and inquiring core words: firstly, the query text is short in length and belongs to entity recognition at a statement level, and the traditional named entity recognition focuses more on analysis at a chapter level, so that the effect of the prior text analysis technology (such as lexical analysis and syntactic analysis) on recognizing query core words is not ideal; secondly, the query text structure is not strict, a large amount of non-standard expressions exist, and the generalization and standardization processing of data are difficult; thirdly, the conventional named entity recognition technology is to recognize a specific entity in a specific text, and only one core word can reflect the search purpose of a user in query, and word context information needs to be deeply mined; fourthly, named entity recognition only recognizes entities in the text, and core word recognition in the query needs to recognize key components which can reflect the search intention of the user in the search and belongs the key components to a specific category. The invention adopts a three-layer Bayes semi-supervised probability model to identify and classify the core words in the query, the core words in the query correspond to the text, the context information of the query corresponds to the words of the text, and the category of the core words corresponds to the subject.

(III) extracting structured query information

Extracting structured query information, i.e., representing natural language text input by a user into a structured data format, defines: a semantic element is a sequence of characters made up of one or more words, assuming structured data in the form of a table R ═ { R ═ R₁,R₂,…,R_nStore in the form of }, where R is_iRepresenting a set of semantic categories and their associated attributes, the semantic category of table R is represented as R.D ═ { R.D ═₁，R.D₂,…,R.D_mWhere m denotes a predefined number of semantic categories to be tagged, and the set of elements for each semantic category is denoted r.d.u ═ r.d.u₁,R.D.U₂,…,R.D.U_kAnd k represents the number of elements in the semantic category, and the elements in the semantic category are character type or numerical type.

Formalized representation of query structured query information extraction: representing a sequence p of user-entered search queries as a binary set<ProWord,R_ProWord>Where Proword represents the core word in the query word p, R_ProWordAnd expressing a structured query information table corresponding to the Proword.

The prior art has the following difficulties and disadvantages in extracting structured query information:

firstly, the query text is not standard, and a large amount of generalization and standardization work is required in the semantic role marking process; secondly, the query text structure is not strict, the existing search engine conducts retrieval based on a word bag model, the search query input by a user is guided to be a pile of some keywords, a lot of queries do not follow the syntactic rule or even do not form a sentence completely, and the text analysis technology in the prior art is not completely applicable; thirdly, due to the diversity of text expression, many examples have the phenomenon of one-word-polysemous or one-word-polysemous, which brings great difficulty to the classification of semantic units; the invention adopts a semi-supervised conditional random domain model to mark semantic units in query, and trains the conditional random domain model by using a small amount of manually marked data and a large amount of semi-automatic marked data (the invention provides a semi-automatic data marking algorithm).

Second, query core word recognition based on semi-supervised conditional random domain

The invention provides a core word recognition algorithm in search query, which is used for recognizing and classifying core words of given query. In the operation process of the website of the electric commerce such as Taobao, Jingdong and the like, the identification of the core words in the inquiry mostly depends on rules, the generation of the rules needs the participation of experienced professionals, the labor cost for identifying the core words by depending on the rules and a word bank is higher, the field specificity is strong, the coverage rate of the algorithm identification result is lower, and the complex inquiry can not be self-adapted. Aiming at the problem, the invention adopts a core word recognition algorithm based on SeS-LDA, the context information of the core word corresponds to the vocabulary of a text, the category of the core word is taken as a theme, and then training is carried out based on a condition random domain model.

Introducing a topic model

When judging the text relevance, not only the co-occurrence condition of words but also the deep level semantics expressed by the text are considered, the invention introduces a theme model for semantic analysis, the theme in the theme model is expressed as a group of generalized expression forms with the same concept, and the problem to be solved comprises the following steps: one is how to generate the topics and the other is how to analyze the topics of the articles. The generation process of the text is illustrated by a generation model: a text contains a plurality of topics, each topic selects a plurality of vocabularies according to probability, and the generation process of the text is represented as follows:

q (vocabulary | text) ═ Σ topic q (word | topic) × q (topic | text)

Matrix form of topic model: wherein the matrix on the left of the equation represents the word frequency of each word in each text, i.e., the probability of occurrence of a word; the first matrix on the right side of the equation represents the occurrence probability of each word in each topic; the second matrix to the right of the equation represents the probability of the occurrence of a different topic in each text. Given a series of texts, preprocessing is performed on the texts in advance, and then the frequency of word occurrence in each text is counted to obtain a text-word matrix on the left side. The topic model is to decompose the matrix on the left side and learn two matrices on the right side.

The invention adopts a three-layer Bayes semi-supervised probability model to identify and query core words, wherein the core words correspond to the text, the context information of the core words corresponds to words in the text, and the category of the core words corresponds to the theme.

(II) constructing conditional random domain topic model

The conditional random domain topic model is a conjugate distribution of two Dirichlets, and the process of generating a text is as follows: firstly, generating a theme vector theta according to a certain probability, wherein each element value in the vector represents the probability of the theme being selected; then, a topic x is selected from the topic vector a, a word is generated according to the word probability distribution of the topic x, the probability graph model of the conditional random domain model is shown in fig. 1, and the joint probability of the conditional random domain is:

the process of generating a text by the conditional random domain is divided into three layers, and corresponding to the graph, as shown in fig. 2, three representation layers of the conditional random domain model are marked by three different colors:

1) morphus-level (solid line): b and c represent parameters of a corpus level, which respectively represent a process of generating a main vector by a text and a process of selecting a word by each topic vector, wherein the two parameters are global parameters and are sampled once during model training;

2) document-level (dotted line): a is a variable at a text level, each text corresponds to different a and represents the theme distribution of the text, the theme x distribution of each text is different, and the generation process of each text needs to sample a once;

3) word-level (five-pointed star): x and k are word level variables, x is generated by a and is simply analyzed as a probability value in a theme vector, k is generated by x and c together, and one word k corresponds to one theme x;

the condition random domain model is mainly determined by two parameters b and c, the training process of the condition random domain is the solving process of the two parameters, k is taken as an observation variable, a and x are taken as hidden variables, the parameters b and c are learned by adopting an EM (effective electromagnetic) algorithm, and the condition random domain model is trained.

Identifying a core word in a query by adopting a three-layer Bayes semi-supervised probability model, formally representing the query containing a single core word as a triple, namely three (q, r, s), wherein q represents the core word in the query, r represents context information of the core word, s represents category information of the core word, and r can be null, namely the core word in the query has no context information, and the problem of identification of the core word in the query is converted into the following steps: given a query, a triplet three (Q, r, s) is found to maximize its joint probability Q (Q, r, s), and the problem is solved by the present invention using a semi-supervised conditional random domain topic model.

(III) topic model derivation

The query formalized representation containing single core word information is a triple three (q, r, s), wherein q (ProWord) represents the core word in the query, r represents the context information of the core word in the query, s represents the category information of the core word, the core word identification problem in the query aims to identify the core word q in the query and belongs q to the most possible category s, and the problem is converted into the triple three (q, r, s) with the highest probability from all possible triples^*：

(q,r,s)^*＝argmax_(q,r,s)Qr(p,q,r,s)

＝argmax_(q,r,s)Qr(p|q,r,s)Qr(q,r,s)

＝argmax_{(q,r,s)∈F(p)}Qr(q,r,s)

Conditional probability Qr (p)| q, r, s) represents the probability that the triplet three (q, r, s) generates query p, a given triplet three (q, r, s) generates a unique query, and for a given query p and triplet three (q, r, s), Qr (p | q, r, s) can only be 0 or 1, i.e. there are only two possibilities: the triple three (q, r, s) generates a query p or the triple three (q, r, s) cannot generate a query p, and f (p) is defined as the set of all triples that can generate a query p, that is, Qr (p | q, r, s) ═ 1, (q, r, s)^*Certainly in f (p), the core word recognition problem in the query can be simplified to find its joint probability Qr (p | q, r, s) for any triple in f (p):

Qr(q,r,s)＝Qr(q)Qr(s|q)Qr(r|q,s)

＝Qr(q)Qr(s|q)Qr(r|s)

in the formula, assuming that Qr (r | q, s) ═ Qr (r | s), the problem of core word recognition in the query of the present invention further evolves to estimate Qr (q), Qr (s | q), and Qr (r | s), which is a huge scale of data, including a large amount of core words and context information.

(IV) semi-supervised conditional random domain model

Let data set R { (q) } {_i,r_i,s_i)|i＝1,…,N}，(q_i,r_i,s_i) The method comprises the steps of querying a triple corresponding to p, N is the scale of a data set, and formalized expression of a core word recognition problem in query is as follows:

if each core word belongs to a single category, an optimization target is constructed according to the formula, but in practical application, the core words are more ambiguous, the number of the core words is huge, and the constructed data set R { (q) {, the number of the core words is large_i,r_i) The category information s corresponding to the core word is divided into a plurality of categories_iAs hidden variables, the optimization goal of the problem becomes the following:

(V) inquiry core word recognition method flow

The invention adopts SS-LDA and training data set to construct a query core word recognition system, which comprises three modules: the system comprises a data preprocessing module, an offline training module and an online marking module, and a flow chart of the method is shown in FIG. 3.

Data preprocessing: data samples are selected from a real user query log as training data, a large amount of noise data (such as misspelling, irregular spelling, mixed case and the like) exist, the training precision of the whole model is greatly influenced, the data preprocessing is used for normalizing and standardizing the search query input by the user, the normalizing processing is used for filtering the search query input by the user (such as large writing to small writing), filtering messy codes, redundant spaces and rab keys and removing stop words, the preprocessing is convenient for further processing the query words, the normalizing processing is used for extracting a word root operation, and the initial form of a single query word is restored.

Off-line training: is data miningAnd a process of solving parameters by using a parameter learning method, wherein due to the huge scale of data samples and the extremely difficult data marking, core words are selected from a training data set as seeds, corresponding category information is marked (a single core word can correspond to a plurality of category information), and then the core words of the seeds are used for scanning the data set to obtain the training data set (q)_i,r_i) And training a theme model by using SS-LDA. Compared with the traditional condition random domain, the SS-LDA topic model of the invention has obvious differences and advantages: firstly, the theme (core word category) is preset; secondly, the topic (category) of each text (core word) is obtained by a weak supervised learning method, by this step, Qr (s | q) is estimated for each seed core word, and Qr (r | s) is obtained for each category, then the data set is scanned again to obtain all queries containing s (updated in the above step), the part from which the context information s is removed is used as a new core word (a critical value is set in the implementation process to ensure the precision), and for the newly extracted core word, Qr (q | s) is updated by SS-LDA again, in this step, the probability Qr (q) of the newly extracted product q is also updated, and the frequency of the core word q appearing in the data set is used to estimate Qr (q), i.e. the higher the frequency of the core word q appearing, the higher the probability Qr (q) is, by the above steps, the required Qr (q) and Qr (r | s in the model are solved, and storing the probability obtained offline so as to effectively perform online prediction.

The problem of core word recognition in search query is the very important content of vertical search engines such as E-commerce and the like, and is directly related to the search experience and the search conversion rate of a user, and a semi-supervised topic model is ingeniously adopted to model the problem: the core words are used as texts, the context information of the core words is used as words forming the texts, the categories of the core words are used as subjects, and a semi-supervised condition random domain model is adopted to carry out mining and classification on the core words.

Structured query information extraction based on semi-supervised conditional random domain

The invention further discusses the problem of marking the semantic role of search, extracts the structured query information from the search query input by the user and marks the semantic role of the structured query information. The background of the vertical website is some semi-structured information, when a user inputs a query, the structured query information is extracted, and the structured information is matched with the background information, so that the search experience of the user is improved. The marking problem of the prior art is that a conditional random field or a similar sequence marking model is trained, but the manual marking of data is carried out by a large amount of manpower, and meanwhile, the problems of inconsistent data marking and the like can be caused. The invention adopts a semi-supervised condition random domain model to solve the problem, and mainly uses two types of data sets: one is a small number of manually tagged queries, and the other is a large number of semi-automatically tagged queries, i.e., automatically tagging certain units in a query through other additional resources. The main contributions of the invention are: the method comprises the steps of firstly, providing a semi-automatic query marking method based on a user click log and a domain knowledge base, and secondly, marking and searching semantic roles based on a semi-supervised conditional random domain.

Formalized representation of a problem

The input of the labeling problem is a known observation sequence, the output is a hidden labeling sequence or a state sequence, the labeling problem learns a model from a training sample so that the model can give the correct labeling sequence to a new observation sequence, the labeling problem is divided into two processes of learning and labeling, and a training data set is given firstly:

R＝{(x₁,y₁),(x₂,y₂),…,(x_n,y_n)}

finding a conditional probability:

the largest tag sequence:

(II) conditional random domain sequence tagging model

Under the condition that a condition random field is given random variable X, a Markov random field of random variable Y and a linear chain element random field are given observation sequences, a condition probability model Q (Y/X) of a mark sequence is calculated, wherein Y is an output variable and represents the mark sequence, X is an input variable and represents the observation sequence needing to be marked, and the learning process is namely beneficialObtaining a conditional probability model Q of a training dataset by maximum likelihood estimation or regularized maximum likelihood estimation using the training dataset^*(Y | X); the prediction process is to find the conditional probability Q for a given observation sequence x from the learned model^*(y | x) maximum state sequence y^*。

Q(Y_U|X，Y_K，k≠u)＝Q(X，Y_K，k～u)

F＝(U＝{1,2,…,n},B＝{i,i+1})

(III) establishing a sequence mark model

The invention segments semantic units for search query input by a user, and assigns each semantic unit to a preset category, and because of huge data scale, the pure text matching performance is not enough to support the normal operation of the system, and simultaneously, ambiguous word processing is also the problem that simple text matching cannot be solved. The problem of marking the semantic role for searching is a typical problem of sequence marking, the invention adopts a method of a sequence marking model to solve the problem of marking the semantic role for searching, and the architecture of the whole semantic role for marking and searching is shown in figure 4.

The input of the tag search semantic role framework includes two types of data: a small amount of manually marked data, a large amount of semi-automatically marked data,the semantic marker is obtained by training and learning the two types of resources, and n marked training data are expressed as (x)⁽ⁱ⁾,y⁽ⁱ⁾) I is 1,2, … n, where x⁽ⁱ⁾Denotes the observation sequence, y⁽ⁱ⁾Representing a marker sequence, q (×) is a probability function, and the goal of model training is to find the optimal parameter vector h^*So that it satisfies:

after the model training is finished, a semantic marker is obtained, and for a given input sequence x, a corresponding output sequence y is obtained^*(semantic class tag sequence):

y^*＝argmax_yq(y|x；h)

training sample input semantic tagger (model) output illustrates a 4-tag search semantic role framework.

Semi-automatic marking method

The main advantages and contributions of the structured query information extraction based on the semi-supervised conditional random domain have two aspects: firstly, a semi-supervised conditional random domain is adopted to solve the semantic role marking problem, a small amount of manually marked data and a large amount of semi-automatically marked data are fused together, and a conditional random domain model is learned in a semi-supervised mode; secondly, a feasible method is provided for semi-automatic marking of the query, and the query is semi-automatically marked based on a semi-automatic marking method and a semi-supervised condition random domain model on the assumption that a search log of a user, namely, a binary group (acquired, project title) of a commodity list returned by a user query and a search engine is available.

The semi-automatic marking method fully utilizes search click logs of users to carry out preposed marking on semantic units of the query, the invention takes E-commerce query as an embodiment, the marking range relates to four categories of Make, Colour, Style and Product, the semi-automatic marking method correlates the semantic units in the query with related commodity information in the commodity click logs, in an E-commerce vertical search engine, the commodity information is stored in a database in a structured or semi-structured mode, the incidence relation can be established between the semantic units in the query and the commodity information through the click logs of the users, and certain semantic units in the query are marked in advance through a character string matching algorithm. The process and the module of the semi-automatic marking method specifically comprise the following steps:

firstly, clicking data is extracted from a search log of a user, when the user searches products by adopting a search engine, an input query of the user and a series of commodity clicks are recorded in the search log of the user in a binary group (inquiry, pro word) mode, and a relationship between the input query of the user and the commodities is established;

secondly, in a commodity information base and an E-commerce vertical search engine, commodity information is stored in the database in structured data, each commodity contains title, attribute and detail information filled by a merchant, in the second stage of semi-automatic marking, the direct relation between the user input query and the commodity structured query information is established, as the click behaviors of the users are relatively few, in order to solve the problem of data sparseness, the similarity between all commodities in the commodity information base and the click commodities input by the users is calculated, the commodities with higher similarity are selected and added into the click commodity set, adopting cosine distance based on TF-IDF as a measurement formula on the similarity measurement, setting a critical value to be 0.75, that is, all the commodities with the similarity larger than 0.75 are added into the user click commodity set, and then the commodity structured query information is matched with the search query input by the user. In order to improve the coverage rate of the mark, a fuzzy matching method is adopted in the matching process, and a mapping relation (inquiry, Metadata) between the user input query and the commodity structured field is established.

Thirdly, data mapping, namely obtaining the mapping relation between the user input query and the structured fields of the commodity through the steps, and mapping the structured fields of the commodity into four categories of Make, Colour, Style and Product through the data mapping.

Fourth, auto-tagging, given a binary (Metadata), semi-auto-tags the initial query entered by the user with the following rules: one is that if a unit in the query does not appear in any of the domains in Metadata, or a word in the query appears in multiple domains in Metadata, the word will be marked NULL; secondly, if a certain unit in the query happens to appear in a certain domain in Metadata, the unit is marked as a category corresponding to the domain; through the steps, the automatic marking process of the query is completed.

(V) structured query information extraction model

Currently, semi-supervised or unsupervised learning methods are developed in the field of natural language processing, and the semi-supervised learning method commonly used in the prior art is a self-learning method: firstly, training a seed model on a small amount of manually marked data sets, then using the model to predict unmarked data, selecting a prediction result with high confidence coefficient, adding the prediction result into the manually marked data sets to expand training samples, and repeating the steps until an ideal effect is achieved. The self-learning methods can solve the problem of insufficient labeled data to a certain extent, but the self-learning methods are a summary of practical experience, lack of theoretical basis, cannot achieve good effect on many problems, and have the potential mode that the state sequence of unlabeled data is unknown, and all unlabeled data cannot be covered by only a small amount of labeled data.

The invention adopts a semi-automatic marking method to carry out prepositive marking on a unit in inquiry, trains a conditional random domain model by utilizing more potential information, adopts a relational data table to carry out semi-automatic marking on a data set, and adopts click log information of a user to finish semi-automatic marking of training data. Due to the lack of a large amount of manual marking data, the condition random domain model based on supervised learning cannot well solve the problem of marking and searching semantic roles.

The present invention is defined as follows:

Second, semi-automatic labeling of datasets: the training data set labeled by the semi-automatic labeling method using additional resources is a set of labeled units in the query, formalized as z ═ z1, z2, …, z_R)。

The semi-automatic marked data set has a supplementary effect on the manual marked data set, and the problem that the manual marked data set cannot cover all modes of unmarked data is solved. The invention mainly uses two types of data sets: firstly, a small amount of manually marked data sets and secondly, a large amount of semi-automatic marked data sets are used for learning a conditional random domain model, only partial semantic units of the semi-automatic marked data sets are marked, and the following assumptions are made: if y is_r＝z_rThen the variable is taken as the observed variable, otherwise the variable is taken as the hidden variable.

(VI) structured query information extraction method flow

An algorithm flow chart for query structured query information extraction is shown in fig. 5, in which initial data is semi-automatically labeled according to the method of the semi-automatic labeling module of the present invention, then unlabeled data is labeled according to the semi-supervised conditional random domain model of the present invention by comprehensively using semi-automatic labeled data and an artificial labeled data set training model, and natural language text input by a user is expressed in the form of structured data.

The invention relates to a semantic role of label search, which relates to two sub-problems: in the query, the core word recognition and the query structured query information extraction verify the effectiveness of the invention through experiments: firstly, the effect of the three-layer Bayes semi-supervised probability model in the query core word recognition is proved through comparison with the sequence marking model, and secondly, the superiority of the semi-supervised condition random domain model in the query structured query information extraction is proved through comparison with the prior art sequence marking model and the classification model.

The invention defines the concept of marking and searching semantic roles, analyzes the application scenes of a topic model and a sequence mark in a query structural analysis layer based on a machine learning topic model and a sequence mark model, and combines a three-layer Bayes semi-supervised probability model and a semi-supervised conditional random domain model to be applied to the field of marking and searching semantic roles. Mainly comprises the following aspects: firstly, on the basic theory level, a topic model and a sequence marking model are deeply analyzed, and systematic research and development are carried out on a model derivation and training method and an application scene thereof, so that the defect of theoretical knowledge is filled, and the practice of structured query information expression is guided more effectively; secondly, the concept of marking and searching semantic roles is put forward by combining semantic role marks, and complete definition is given: representing a search query input by a user into a structured data format governed by core words, and marking the core words and other semantic arguments governed by the core words in the query; thirdly, performing mathematical modeling on the search query input by the user, expressing the search query in a triple formalization mode, and then identifying a core word in the query by adopting a semi-supervised topic model; fourthly, analyzing the internal language structure of the query, adopting a semi-supervised condition random domain model to extract structured query information, expressing the natural language text input by the user into structured data, and obtaining good effect on a real E-commerce search engine data set; and fifthly, carrying out comparison tests on the real data set according to various conditions of different models and different feature spaces, and the experimental result proves the superiority of the method in marking and searching semantic roles.

Claims

1. The structural query information expression method for marking the semantic role of search is characterized in that the structural query information is extracted from the search query input by a user, a natural language text is expressed into structural data, the search intention of the user is accurately analyzed, and the search satisfaction of the user is improved; based on the potential semantic structure of the query and the extraction of the search structural query information, formalized representation is carried out, the concept of marking the search semantic role is provided and the complete definition is given: representing a search query input by a user into a structured data format dominated by core words, and marking the core words and semantic argument dominated by the core words in the search query;

2. The method for expressing the structural query information of the labeled search semantic role according to claim 1, wherein the semantic role label is a method for labeling a predicate in a sentence and other components governed by the predicate, and deeply parsing the structure of the sentence to perform semantic analysis, the semantic role label identifies a predicate in a sentence and other semantic arguments governed by the predicate, the labeled search semantic role automatically labels each semantic role in the search query, analyzes the structure of the query to deeply parse the search intention of a user, the query sentence is governed by a core word, and the other components in the query are subordinate to the core word;

p→{ProWord；SeUnit₁,SeUnit₂,…,SeUnit_n}

3. The method for expressing the structural query information of the mark search semantic role according to claim 1, characterized in that a topic model is introduced: when judging the text relevance, not only the co-occurrence condition of words is considered, but also the deep level semantics expressed by the text are considered, a theme model is introduced for semantic analysis, the theme in the theme model is expressed as a group of generalized expression forms with the same concept, and the generation process of the text is explained by using a generation model: a text contains a plurality of topics, each topic selects a plurality of vocabularies according to probability, and the generation process of the text is represented as follows:

q (vocabulary | text) ═ Σ topic q (word | topic) × q (topic | text)

4. The method for expressing the structural query information of the mark search semantic role according to claim 1, characterized in that a topic model is derived by: the formalized representation of a query containing information about a single core word is a triple three (q, r, s), where q represents the core word in the query, r represents context information about the core word in the query, s represents category information about the core word, the goal of a core word recognition problem in the query is to identify the core word q in the query and assign q to the most likely category s, and the problem is converted to find a triple three (q, r, s) with the highest probability from all possible triples^*：

(q,r,s)^*＝argmax_(q,r,s)Qr(p,q,r,s)

＝argmax_(q,r,s)Qr(p|q,r,s)Qr(q,r,s)

＝argmax_{(q,r,s)∈F(p)}Qr(q,r,s)

Qr(q,r,s)＝Qr(q)Qr(s|q)Qr(r|q,s)

＝Qr(q)Qr(s|q)Qr(r|s)

in the equation, the problem of core word recognition in queries assuming Qr (r | q, s) ═ Qr (r | s) further evolves to estimate Qr (q), Qr (s | q), and Qr (r | s), which contain a large amount of core words and context information.

5. The method for expressing the structural query information of the labeled search semantic role according to claim 4, wherein a semi-supervised conditional random domain model: let data set R { (q) } {_i,r_i,s_i)|i＝1,…,N}，(q_i,r_i,s_i) The method comprises the steps of querying a triple corresponding to p, N is the scale of a data set, and formalized expression of a core word recognition problem in query is as follows:

the solution of the problem becomes the probability estimation problem of the above formula, which is expressed as a topic model in form, the core word corresponds to the text, the context information of the core word corresponds to the word of the text, the category information corresponds to the topic, a conditional random domain topic model is adopted, the conditional random domain model adopts a semi-supervised mode for learning, namely SS-LDA, the topic, namely the category, is agreed in advance, and the topic of each text, namely the core word is marked in the training data set.

6. The method for expressing the structural query information of the labeled search semantic role according to claim 5, wherein the query core word recognition method comprises the following procedures: the method adopts SS-LDA and a training data set to construct a query core word recognition system, and comprises three modules: the system comprises a data preprocessing module, an offline training module and an online marking module;

off-line training: the method is a process for solving parameters by a data mining and parameter learning method, and comprises the steps of firstly selecting core words from a training data set as seeds, marking the corresponding class information, and then scanning the data set by using the seed core words to obtain a training data set (q)_i,r_i) Training a topic model by SS-LDA, estimating Qr (s | q) for each seed core word, obtaining Qr (r | s) for each category, scanning the data set again to obtain all queries containing s, taking the part without the context information s as a new core word, updating Qr (q | s) by SS-LDA again for the newly extracted core word, updating the probability Qr (q) of the newly extracted product q in the step, and estimating Qr (q) by using the frequency of the core word q appearing in the data set, namely the core word q, wherein the probability Qr (q) of the newly extracted product q is updatedThe higher the frequency of the word q appears, the higher the probability Qr (q) is, and through the steps, the Qr (q) and Qr (r | s) required in the model are solved, and the probability obtained under the line is stored so as to effectively predict on the line;

7. The method for expressing the structural query information of the marker search semantic role according to claim 1, wherein the marker question formally expresses: the input of the labeling problem is a known observation sequence, the output is a hidden labeling sequence or a state sequence, the labeling problem learns a model from a training sample so that the model can give the correct labeling sequence to a new observation sequence, the labeling problem is divided into two processes of learning and labeling, and a training data set is given firstly:

R＝{(x₁,y₁),(x₂,y₂),…,(x_n,y_n)}

wherein x is_i＝{x_i ⁽¹⁾,x_i ⁽²⁾,…,x_i ⁽ⁿ⁾1,2, …, n is the observation sequence, y_i＝(y_i ⁽¹⁾,y_i ⁽²⁾,…,y_i ⁽ⁿ⁾) Is a corresponding marker sequence or state sequence, n represents the length of an observation sequence, a learning system learns a model from a training data set, and the whole process is represented by conditional probability distribution:

wherein each X is⁽ⁱ⁾(i-1, 2, …, n) is taken to all possible observations, each Y being⁽ⁱ⁾(i-1, 2, …, n) is taken to be all possible labels,the marking system finds a corresponding state sequence as output for the new input observation sequence according to the learned conditional probability distribution model, and specifically comprises the following steps: for one observation sequence:

finding a conditional probability:

the largest tag sequence:

the semantic role of label search is a typical label problem, which is solved by adopting a sequence label model, in particular to adopt a semi-supervised conditional random domain model to label the semantic role of search.

8. The method for expressing the structural query information of the marker search semantic role according to claim 1, wherein a conditional random domain sequence marker model: under the condition that a condition random field is a given random variable X, a Markov random field of a random variable Y, a linear chain element random field is a given observation sequence, a condition probability model Q (Y/X) of a mark sequence is calculated, wherein Y is an output variable and represents the mark sequence, X is an input variable and represents the observation sequence needing to be marked, and a learning process is to obtain the condition probability model Q of a training data set by utilizing the training data set through maximum likelihood estimation or regularized maximum likelihood estimation^*(Y | X); a prediction process, namely, for a given observation sequence x, calculating a state sequence y with the maximum conditional probability Q (y | x) according to a learned model;

Q(Y_U|X，X_K，k≠u)＝Q(X，Y_K，k～u)

F＝(U＝{1,2,…,n},B＝{i,i+1})

9. The method for expressing the structural query information of the marker search semantic role according to claim 1, characterized in that a sequence marker model is established: segmenting semantic units of search queries input by users, attributing each semantic unit to a preset category, and solving the problem of marking search semantic roles by adopting a sequence marking model;

the input of the tag search semantic role framework includes two types of data: firstly, a small amount of manually marked data and secondly, a large amount of semi-automatically marked data, the semantic marker is obtained by training and learning the two types of resources, and n marked training data are expressed as (x)⁽ⁱ⁾,y⁽ⁱ⁾) I is 1,2, … n, where x⁽ⁱ⁾Denotes the observation sequence, y⁽ⁱ⁾Representing a marker sequence, q (×) is a probability function, and the goal of model training is to find the optimal parameter vector h, so that it satisfies:

after the model training is completed, obtaining a semantic marker, and obtaining a corresponding output sequence y for a given input sequence x:

y^*＝argmax_yq(y|x；h)

10. The method for expressing the structural query information of the labeled search semantic role according to claim 1, wherein the structural query information extraction model comprises: pre-marking units in inquiry by adopting a semi-automatic marking method, training a conditional random domain model by utilizing more potential information, semi-automatically marking a data set by adopting a relational data table, and completing semi-automatic marking of training data by adopting click log information of a user;

as defined below:

The semi-automatic marked data set has a supplementary effect on the manually marked data set, the problem that the manually marked data set cannot cover all modes of unmarked data is solved, and two types of data sets are mainly used: firstly, a small amount of manually marked data sets and secondly, a large amount of semi-automatic marked data sets are used for learning a conditional random domain model, only partial semantic units of the semi-automatic marked data sets are marked, and the following assumptions are made: if y is_r＝z_rThen y is_rAs an observation variable, otherwise, y_rAs hidden variables.