CN112818091A

CN112818091A - Object query method, device, medium and equipment based on keyword extraction

Info

Publication number: CN112818091A
Application number: CN201911120133.1A
Authority: CN
Inventors: 王娜; 肖宁; 高云; 胡忆桐; 左丽丽
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2019-11-15
Filing date: 2019-11-15
Publication date: 2021-05-18

Abstract

The disclosure provides an object query method based on keyword extraction, an object query device based on keyword extraction, a computer readable storage medium and electronic equipment, and belongs to the technical field of natural language processing. The method comprises the following steps: performing word segmentation on a subject text of an object to be queried to obtain a plurality of candidate keywords; according to the semantic similarity among the candidate keywords, establishing a graph model by taking one or more candidate keywords as vertexes; determining the weight of the vertex in the graph model based on an iterative algorithm of vertex weight; determining a target keyword related to the object to be queried from the candidate keywords corresponding to the vertex according to the weight of the vertex; and when a query request containing the target keyword is received, adding the object to be queried to a query result of the query request. According to the method and the device, the accuracy of extracting the keywords of the object text is improved through the semantic relation, and the accuracy of object query is improved.

Description

Object query method, device, medium and equipment based on keyword extraction

Technical Field

The present disclosure relates to the field of natural language processing technologies, and in particular, to an object query method based on keyword extraction, an object query apparatus based on keyword extraction, a computer-readable storage medium, and an electronic device.

Background

With the development of computer technology, a method for realizing object query by extracting keywords has been widely applied to various types of application fields, for example: the method mainly comprises the steps of extracting object keywords, and matching the keywords with input words when a user inquires, so that an inquiry object of the user is determined.

At present, the extraction of keywords of an object is mainly realized by calculating the frequency and position relationship of words in an object text, for example, determining a word (modified word) which has a higher occurrence frequency and is located at the last of a word group as a keyword. However, in practical applications, the frequency of words is highly related to the content and length of a text, for example, the frequency of words in a short text is generally low, so that keywords and non-keywords cannot be distinguished, and the position relationship is different depending on the writer style, for example, in a syntax structure such as an inverted structure and a clause, or in a scene such as a forum and a network chat that has relaxed grammar requirements, the position relationship between words is not fixed, so that the accuracy of object query by the keyword extraction method is not high.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The disclosure provides an object query method based on keyword extraction, an object query device based on keyword extraction, a computer-readable storage medium and an electronic device, so as to improve the problem of low object query accuracy in the prior art at least to a certain extent.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.

According to a first aspect of the present disclosure, there is provided an object query method based on keyword extraction, the method including: performing word segmentation on a subject text of an object to be queried to obtain a plurality of candidate keywords; according to the semantic similarity among the candidate keywords, establishing a graph model by taking one or more candidate keywords as vertexes; determining the weight of the vertex in the graph model based on an iterative algorithm of vertex weight; determining a target keyword related to the object to be queried from the candidate keywords corresponding to the vertex according to the weight of the vertex; and when a query request containing the target keyword is received, adding the object to be queried to a query result of the query request.

In an exemplary embodiment of the present disclosure, the semantic similarity between the candidate keywords is obtained by: obtaining semantic vectors of the candidate keywords by using a word vector model; and calculating the similarity of any two semantic vectors to serve as the semantic similarity between the candidate keywords corresponding to the two semantic vectors.

In an exemplary embodiment of the present disclosure, the word vector model is obtained by: acquiring a general corpus and a target scene corpus; respectively segmenting the texts in the general corpus set and the texts in the target scene corpus set to obtain a general word bank and a target scene word bank; and training according to the general word stock and the target scene word stock to obtain the word vector model.

In an exemplary embodiment of the present disclosure, the graph model comprises an undirected graph; the establishing of the graph model by taking one or more candidate keywords as vertexes according to the semantic similarity among the candidate keywords comprises the following steps: taking each candidate keyword as a vertex; for each candidate keyword, arranging word pairs formed by the candidate keyword and other candidate keywords respectively according to the sequence of the semantic similarity from high to low, selecting the first N word pairs, and establishing an edge between two vertexes corresponding to each word pair respectively to construct the undirected graph; wherein N is a preset positive integer.

In an exemplary embodiment of the present disclosure, after constructing the undirected graph, the method further comprises: judging whether the undirected graph is a connected graph or not; if the undirected graph is a connected graph, executing the iterative algorithm based on the vertex weight, and determining the weight of the vertex in the undirected graph; if the undirected graph is not a connected graph, increasing N, and reestablishing the undirected graph until the undirected graph is the connected graph.

In an exemplary embodiment of the disclosure, the vertex weight-based iterative algorithm, determining the weight of the vertex in the graph model includes: acquiring initial weight of each vertex in the graph model; and adopting a text sorting algorithm, taking semantic similarity between candidate keywords corresponding to any two vertexes as the edge weight between the two vertexes, iteratively updating the weight of each vertex in the graph model until convergence, and determining the weight of each vertex.

In an exemplary embodiment of the present disclosure, the determining, according to the weight of the vertex, a target keyword related to the object to be queried from the candidate keywords corresponding to the vertex includes: the vertexes are ranked according to the sequence of the weights from high to low, and the candidate keywords corresponding to the first K vertexes are determined as the target keywords related to the object to be inquired; wherein K is a preset positive integer.

In an exemplary embodiment of the present disclosure, the object to be queried includes a commodity; the subject text includes any one or more of: title of the goods, name of the goods, brief introduction of the goods.

According to a second aspect of the present disclosure, there is provided an object query apparatus based on keyword extraction, the apparatus including: the processing module is used for carrying out word segmentation processing on the subject text of the object to be inquired to obtain a plurality of candidate keywords; the establishing module is used for establishing a graph model by taking one or more candidate keywords as vertexes according to the semantic similarity among the candidate keywords; the calculation module is used for determining the weight of the vertex in the graph model based on an iterative algorithm of vertex weight; the determining module is used for determining a target keyword related to the object to be queried from the candidate keywords corresponding to the vertex according to the weight of the vertex; and the adding module is used for adding the object to be queried to a query result of the query request when the query request containing the target keyword is received.

In an exemplary embodiment of the present disclosure, the establishing module includes: the semantic vector unit is used for obtaining a semantic vector of the candidate keyword by utilizing a word vector model; and the calculating unit is used for calculating the similarity of any two semantic vectors to serve as the semantic similarity between the candidate keywords corresponding to the two semantic vectors.

In an exemplary embodiment of the present disclosure, the semantic vector unit may further include: the obtaining subunit is used for obtaining the general corpus and the target scene corpus; the word segmentation subunit is used for respectively segmenting the texts in the general corpus set and the texts in the target scene corpus set to obtain a general word bank and a target scene word bank; and the model subunit is used for training according to the general word bank and the target scene word bank and obtaining the word vector model.

In an exemplary embodiment of the present disclosure, the graph model comprises an undirected graph; the establishing module is used for taking each candidate keyword as a vertex; for each candidate keyword, arranging word pairs formed by the candidate keyword and other candidate keywords respectively according to the sequence of the semantic similarity from high to low, selecting the first N word pairs, and establishing an edge between two vertexes corresponding to each word pair respectively to construct the undirected graph; wherein N is a preset positive integer.

In an exemplary embodiment of the present disclosure, after the undirected graph is constructed, the computing module further includes a determining unit, configured to determine whether the undirected graph is a connected graph; if the undirected graph is a connected graph, executing the iterative algorithm based on the vertex weight, and determining the weight of the vertex in the undirected graph; if the undirected graph is not a connected graph, increasing N, and reestablishing the undirected graph until the undirected graph is the connected graph.

In an exemplary embodiment of the disclosure, the calculation module is further configured to obtain an initial weight of each vertex in the graph model; and adopting a text sorting algorithm, taking semantic similarity between candidate keywords corresponding to any two vertexes as the edge weight between the two vertexes, iteratively updating the weight of each vertex in the graph model until convergence, and determining the weight of each vertex.

In an exemplary embodiment of the disclosure, the determining module is configured to rank the vertices according to an order from high to low of the weights, and determine the candidate keywords corresponding to the first K vertices as target keywords related to the object to be queried; wherein K is a preset positive integer.

According to a third aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements any one of the object querying methods described above.

According to a fourth aspect of the present disclosure, there is provided an electronic device comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform any one of the object query methods described above via execution of the executable instructions.

The present disclosure has the following beneficial effects:

the disclosed exemplary embodiment provides an object query method based on keyword extraction, an object query device based on keyword extraction, a computer readable storage medium and an electronic device, wherein candidate keywords are obtained by segmenting an obtained subject text of an object to be queried, a graph model is established according to semantic similarity between the candidate keywords, and finally, the weight of each vertex is determined by adopting a computation method of the vertex weight of the graph model, so that a target keyword is determined, and information about the object to be queried is obtained by querying according to the target keyword. Compared with the prior art that text keywords are extracted through a word frequency and other statistical methods to perform object query, the semantic understanding of words in a text can be increased, the association degree between the words is determined, the importance degree of the words in the text is further determined, the extraction accuracy of the text keywords is further improved, and the accuracy of querying object information through the keywords is further improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is apparent that the drawings in the following description are only some embodiments of the present disclosure, and that other drawings can be obtained from those drawings without inventive effort for a person skilled in the art.

FIG. 1 illustrates a flow chart of an object query method in the exemplary embodiment;

FIG. 2 illustrates a sub-flow diagram of an object query method in the exemplary embodiment;

FIG. 3 illustrates a flow chart of another object query method in the exemplary embodiment;

fig. 4 is a block diagram showing a structure of an object querying device in the present exemplary embodiment;

FIG. 5 illustrates a computer-readable storage medium for implementing the above-described method in the present exemplary embodiment;

fig. 6 shows an electronic device for implementing the above method in the present exemplary embodiment.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. The exemplary embodiments, however, may be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

An exemplary embodiment of the present disclosure first provides an object query method based on keyword extraction. The method can be applied to obtaining corresponding object information through keyword query, for example, commodities required by a user can be queried through inputting commodity names on an e-commerce platform; searching for topics of interest to the user in social applications, and the like.

Fig. 1 shows a flow of the present exemplary embodiment, which may include the following steps S110 to S150:

and S110, performing word segmentation on the subject text of the object to be queried to obtain a plurality of candidate keywords.

The object to be queried refers to object information which can be obtained through query; the subject text of the object to be queried is the text content of the object to be queried for keyword extraction, in this exemplary embodiment, the object to be queried may be a commodity, the commodity may be an article, and may also be a certain network resource, such as a network course, a video/audio program, and the like, and the subject text of the object to be queried may be one or more of a commodity title, a commodity name, a commodity introduction, a commodity content, and the like, which are formed by characters, symbols, numbers, and the like. The combination of characters and symbols forms a text language related to the object to be queried with different meanings and expression modes, and when text recognition and processing are performed through a machine language, the text language is analyzed by taking words as basic units, namely, sentences or paragraphs in the text need to be divided into a plurality of mutually independent, complete and correct words, namely, word segmentation processing. In english, words are usually separated by spaces, and particularly, part-specific words or special words can be represented by a plurality of words, for example: "New York"; for Chinese text, punctuation marks, special marks and connecting words, indicator words and other designated stop words, such as "and", "of", "bar", etc., need to be removed from the text. For example: the theme text of the object to be inquired is 'XX (brand name) large-suction large-size European QW220 smoke exhaust ventilator', the stop word QW220 is removed according to the corresponding stop word bank, and the obtained word segmentation result is XX, large suction, large size, European type smoke exhaust ventilator.

In an optional implementation manner, when the subject text of the object to be queried is an english text, performing word segmentation processing on the subject text can extract word combinations with special connection relations through a preset word segmentation library, and using the word combinations and other words in the text as candidate keywords of the subject text of the object to be queried.

Because Chinese is a writing unit taking characters as a basic unit, no obvious distinguishing mark exists among words, and the words need to be segmented according to corresponding rules, in an optional implementation mode, the characters or words in the Chinese text can be matched through a pre-configured 'dictionary', the 'dictionary' can be set as a professional field word stock according to information such as the use frequency of the characters or words in related fields, and can also be set as a general word stock according to the use conditions of multiple fields, when the corresponding characters or words are inquired in the 'dictionary', the characters or words can be used as a basic unit, so the characters or words can be used as candidate keywords, and when the characters or words are not inquired, the characters or words are removed.

In addition, because the number and the combination mode of the Chinese characters are various, the comprehension of the text can be simulated by a computer, for example: the method can analyze the grammar and the semantics of the text and process the ambiguity phenomenon in the word segmentation while the word segmentation is carried out, thereby improving the accuracy of the word segmentation.

In an alternative embodiment, the above process can be implemented by using an existing word segmentation tool, for example, jieba (a chinese natural language processing tool), CoreNLP (a word segmentation toolkit developed by stanford university team, supporting chinese), LTP (a language analysis tool developed by harbourne industry university team), etc. all support chinese and english word segmentation plus part-of-speech tagging, and a custom word library, such as a corpus related to a business scenario, etc., can be added thereto to implement targeted word segmentation.

In addition, the word segmentation method may also adopt a statistical word segmentation method, a semantic word segmentation method, etc., and the above methods are only used as exemplary illustrations. It should be understood that the word segmentation processing method described in the present exemplary embodiment should not be taken as a limitation on the scope of the word segmentation method of the present disclosure.

And S120, establishing a graph model by taking one or more candidate keywords as vertexes according to the semantic similarity among the candidate keywords.

The semantic similarity measures the semantic similarity between words through the angle of a machine, and generally, it is difficult to directly calculate the semantic similarity between words, so that words can be converted into word vectors firstly, and the similarity between the word vectors is taken as the semantic similarity between words; the graph model is a structural model composed of a plurality of vertexes and edges connecting the vertexes of the parts, and is different from a common geometric figure, each vertex and each edge of the graph model can be endowed with corresponding values to represent distance, flow, cost and the like, so that the graph model can be used for describing the relationship among things and is an effective method for analyzing a complex system.

And according to the semantic similarity among the candidate keywords, selecting part or all of the candidate keywords as vertexes, and establishing a graph model formed by the candidate keywords.

In an alternative embodiment, the graph model may be an undirected graph, and thus step S120 may be implemented by:

taking each candidate keyword as a vertex;

and for each candidate keyword, arranging word pairs formed by the candidate keyword and other candidate keywords respectively according to the sequence of semantic similarity from high to low, selecting the first N word pairs, and establishing edges between two vertexes corresponding to each word pair respectively to construct an undirected graph.

Wherein, N is the window size for selecting the candidate keywords, the value of which may be a preset positive integer, and may be generally selected according to experience, or a value corresponding to a certain proportion, for example, 30% of the number of the candidate keywords may be taken as N according to the total number of the candidate keywords, that is, the number of the candidate keywords constituting the graph model; an undirected graph is a graph model in which edges between vertices represent only correlation, and no directivity.

And taking each candidate keyword as a vertex, determining edges between the vertices in the graph model according to the relation between the vertices, namely semantic similarity, wherein the semantic similarity can reflect the association degree of a word pair formed by the candidate keywords in a short text, and the higher the semantic similarity is, the higher the probability of the occurrence of two corresponding candidate keywords in a text is, so that the first N word pairs are selected, and a straight line or a curve is established between the vertices corresponding to the word pairs to be used as the edges of the two vertices to construct the undirected graph.

After the graph model is constructed, in order to determine the optimal window size, i.e., the optimal N value, of the candidate keyword selection, iterative computation may be performed according to the properties of the graph model to obtain the value of N. In an optional implementation manner, if the graph model is an undirected graph, the method for determining the value of N may be implemented by:

judging whether the undirected graph is a connected graph or not;

if the undirected graph is a connected graph, the value of N is unchanged, and step S130 is executed;

if the undirected graph is not a connected graph, increasing N, and reestablishing the undirected graph until the undirected graph is the connected graph.

The connected graph means that any two vertexes in the undirected graph have a connection relationship, that is, when the undirected graph is the connected graph, the corresponding candidate keywords have a connection relationship therebetween, and the connection relationship represents the possibility that the candidate keywords form the subject text of the object to be queried, that is, the connected graph contains relatively few candidate keywords forming the subject text of the object to be queried.

Judging whether the undirected graph is a connected graph, a traversal search method can be adopted, for example: and adopting a DFS (Deep-First Search) method, starting from any vertex of an undirected graph, searching adjacent vertices connected with the undirected graph, marking the searched vertices, recursively searching each vertex to judge whether the vertices can reach each other, if the vertices can reach each other, the undirected graph is a connected graph, otherwise, if the vertices are not connected graphs, increasing N, and reestablishing the undirected graph until the undirected graph is the connected graph. Specifically, when the window size, i.e., N, is equal to the number of candidate keywords, there must be a connected graph.

And S130, determining the weight of the vertex in the graph model based on an iterative algorithm of the vertex weight.

The vertex weight represents the vertex, namely the importance degree of the corresponding candidate keyword in the subject text of the object to be inquired, and the higher the value of the vertex weight is, the more important the candidate keyword is; the vertex weight iterative algorithm is a vertex weight calculation method, and the method can carry out iterative calculation until the weight of the vertex meets a preset convergence condition, and the vertex weight obtained by the vertex weight iterative algorithm tends to a certain stable value when the convergence condition is met, so that each vertex weight obtained at the moment can be used as the weight of each corresponding candidate keyword.

Calculating the weight of each vertex in the graph model according to an iterative algorithm of the vertex weights, judging whether the weight meets a convergence condition, if so, stopping the calculation, wherein the weight is the final weight of each vertex in the graph model; if not, updating the weight to be recalculated until the convergence condition is met.

In an alternative embodiment, step S130 may be implemented by:

acquiring initial weight of each vertex in the graph model;

and (3) adopting a text sorting algorithm, taking semantic similarity between candidate keywords corresponding to any two vertexes as the weight of an edge between the two vertexes, iteratively updating the weight of each vertex in the graph model until convergence, and determining the weight of each vertex.

Wherein, the initial weight of the vertex is the vertex weight when the text sorting algorithm is firstly calculated; the text sorting algorithm is a method for calculating the importance of texts or words and sorting the texts or words according to the importance so as to obtain the key text content.

In an alternative embodiment, the graph model may be an undirected graph, and the initial weight of a vertex may be determined according to the number of the vertex, for example: if the graph model is an undirected graph and x vertices are shared in the graph model, the initial weight of each vertex can be set to 1/x.

In an optional implementation manner, the text ranking algorithm may be a TextRank algorithm (a text ranking algorithm based on a graph), the method obtains the text keywords by establishing a graph model and iteratively calculating vertex weights in the graph model with the candidate keywords as vertices, and specifically, after establishing the graph model with respect to the candidate keywords, the method may iteratively calculate the weights of the vertices corresponding to the candidate keywords by using the following formula:

wherein, V_i，V_jRespectively representing vertexes i and j formed by candidate keywords;

d is a damping coefficient, the value range is 0 to 1, the probability from one vertex to any other vertex in the graph model is represented, and the value can be generally 0.85;

In(V_i) Is a set of vertices pointing to i;

Out(V_j) Is the vertex set pointed to by vertex j;

in an undirected graph, In (V) because there is no directivity between the vertices_i) The representative vertex set refers to all vertex sets having a connection relation with the vertex i; out (V)_j) Represents the whole vertex set having a connection relation with the vertex j, and In (V) when i is equal to j_i) And Out (V)_j) The set of vertices represented is the same.

w_ji，w_jkRespectively representing the weights of the vertexes j and i, j and k, namely the weights of the edges between the corresponding vertexes, the exemplary embodiment may use the semantic similarity of the candidate keywords corresponding to i and j as the weights of the edges i and j;

Weight(V_i) And Weight (V)_j) The weights of the vertices i and j, respectively, are expressed, that is, the weights of the candidate keywords corresponding to the vertices.

When the vertex weights obtained by the above equation satisfy a convergence condition, for example: and when the number of times of iterative computation reaches a certain threshold value or in all vertexes, and the weight difference of each vertex obtained in two consecutive times is smaller than a preset value, stopping iteration, wherein the obtained vertex weight value is the final weight of each vertex.

In an alternative embodiment, the above iterative algorithm may obtain the weight value sequence S ═ W ═ for each vertex and vertex₁,S₁),(W₂,S₂),…,(W_m,S_m)]Wherein m is the number of vertexes, W_mWeight, S, representing the m-th vertex_mThe weight value of the mth vertex is represented.

And S140, determining target keywords related to the object to be queried from the candidate keywords corresponding to the vertex according to the weight of the vertex.

The target keywords are keywords of the subject text of the object to be queried, and may be composed of part or all of the candidate keywords.

After the weight of each vertex in the graph model is obtained, determining the weight meeting the preset condition, and taking the candidate keyword corresponding to the weight vertex in the graph model as the keyword of the subject text of the object to be queried, namely the target keyword.

In an alternative embodiment, step S140 may be implemented by:

sequencing the vertexes according to the weight sequence from high to low, and determining candidate keywords corresponding to the first K vertexes as target keywords related to the object to be queried;

where K is a preset positive integer, and may be generally determined according to the number of the candidate keywords, for example, the number of the keywords corresponding to 20% of the total number of the candidate keywords is determined as K, which is the number value of the target keywords. Using weight value sequence S ═ W [ [ (W)₁,S₁),(W₂,S₂),…,(W_m,S_m)]For example, the first keyword in the sequence, that is, the vertex weight values, may be arranged in the weight value arrays in the weight value sequence from high to low, the arrays corresponding to the first K weight values are selected, and the corresponding candidate keyword is obtained according to the vertex sequence number corresponding to the array.

The weight represents the importance degree of the candidate keyword corresponding to the weight vertex in the subject text of the object to be queried, or is called influence factor, the higher the weight value is, the higher the importance degree of the corresponding candidate keyword is, and the lower the weight value is, the lower the importance degree of the corresponding candidate keyword is. And selecting the number K according to the preset keywords, selecting each weight of the front K after the weights are arranged in a reverse order, wherein the corresponding candidate keywords are the keywords of the obtained theme text of the object to be inquired.

In an optional implementation manner, the method for selecting the candidate keyword corresponding to the vertex according to the weight may be further implemented by:

presetting a weight threshold value, wherein the weight threshold value can be obtained according to text keyword statistics similar to the subject text content or type of an object to be inquired;

and determining the candidate keywords corresponding to the vertexes with the weight values larger than the weight threshold value as target keywords related to the object to be inquired.

And S150, when a query request containing the target keyword is received, adding the object to be queried to a query result of the query request.

The query request may be a request initiated by a user to obtain information of a specific object through a query manner, for example: in e-commerce applications, a user may obtain relevant commodity information by inputting a name of an object to be obtained as a query request.

Since the types and the number of the objects to be queried are often very large, and the query result may include one or more objects, in an alternative embodiment, when a query request including a target keyword is received, the objects to be queried may be added to the query result of the query request, that is, the objects to be queried are returned to the query requester as a part of the query result.

In step S120, a graph model for determining the target keyword is constructed by calculating semantic similarity of the candidate keywords, where the semantic similarity may represent association between the candidate keywords, and specifically, in an alternative embodiment, the semantic similarity may be obtained by:

obtaining semantic vectors of the candidate keywords by using a word vector model;

and calculating the similarity of any two semantic vectors to serve as the semantic similarity between the candidate keywords corresponding to the two semantic vectors.

The word vector model is a language model which converts words in natural language into vectors and excavates characteristics between words and sentences in characters by a machine learning method; the semantic vector refers to a vector in a semantic space, and words in similar contexts usually have similar semantics according to the distributed assumption of Harris, so that the words are converted into the semantic vector, the similarities and differences between texts can be determined numerically, and the method is an effective text processing method.

And converting characters of the candidate keywords into numerical values to be used as input of a word vector model, and determining probability distribution of input vectors through the word vector model so as to determine confidence of a text forming mode, such as a probability value or a probability range of a word-word connection relation. Due to the complexity of text composition, the input vector of the candidate keyword in the word vector model generally has a higher dimension, and after the conversion of the word vector model, the input vector can be converted into a semantic vector with a lower dimension.

After the semantic vectors are obtained, the similarity between any two semantic vectors can be calculated to serve as the semantic similarity between corresponding candidate keywords, and the similarity between the two semantic vectors can be determined through cosine similarity, Euclidean distance, correlation coefficients and the like between the vectors.

In an alternative embodiment, the Word Vector model may be a Word2vec (Word to Vector) model, which may predict a target Word according to a context relationship of a text, including an input layer, a hidden layer, and an output layer, in an exemplary embodiment of the present disclosure, a low-dimensional Word Vector of a candidate keyword, that is, a semantic Vector, may be obtained by using the semantic model, and specifically, a method for obtaining the semantic Vector of the candidate keyword according to the Word2vec model may include the following steps:

(1) the candidate keywords are coded and represented by one-hot code (one-hot code), the coding mode means that only a vector with 1 bit being 1 and the rest being 0 represents a specific word, and the coded candidate keyword vector is used as an input vector of an input layer;

(2) in the hidden layer, the input vector is subjected to linear transformation through a weight matrix obtained by a word vector model, and a low-dimensional vector corresponding to the candidate keyword vector, namely a semantic vector, is obtained.

In an alternative embodiment, the word vector model may include a number of parameters, such as: the weight matrix of the hidden layer, the number of neurons in each layer, the model learning rate, and the like, and each parameter of the word vector model constitutes a certain word vector model, specifically, the word vector model can be obtained through the following steps S210 to S230:

and S210, acquiring a general corpus and a target scene corpus.

The corpus refers to a text library configured according to a research purpose, and may include characters, words, sentences, paragraphs, and the like; the general corpus is a data set covering a plurality of domain corpora for performing common study of texts, and the target corpus is a special corpus designed for a specific purpose or domain.

And S220, performing word segmentation on the texts in the general corpus set and the texts in the target scene corpus set respectively to obtain a general word bank and a target scene word bank.

And when the general corpus and the target scene corpus contain texts, performing word segmentation processing on the texts respectively to obtain a general word bank and a target scene word bank.

And S230, training according to the general word bank and the target scene word bank and obtaining a word vector model.

Respectively converting the general word stock and the target scene word stock into vectors which are respectively used as input and output of a word vector model; the method comprises the steps of establishing an objective function, wherein the objective function is a function for calculating the maximum probability of a sample through a probability model, obtaining the maximum estimation value of the probability through a statistical method for determining the maximum mathematical probability of a word vector model, for example, adopting a maximum likelihood estimation method, obtaining the probability value through supposing the probability distribution of a word vector, and determining the relevant parameters of the model at the moment when obtaining the maximum probability value of the word vector so as to obtain the determined word vector model.

Taking Skip-Gram model as an example, the model is a simple neural network model in Word2vec, and can be used for predicting Word neighbors, that is, predicting other words located before and after a Word, using an input vector obtained by encoding candidate keywords as an input, using a weight matrix of a hidden layer, such as an m × n order matrix, where m represents an input vector corresponding to each candidate keyword, n represents each neural unit, generating 1 × n Word vectors, transmitting the Word vectors to an output layer, and using an activation function of the output layer to obtain an output vector of each Word vector, where each output vector corresponds to a predicted Word. And after the training is finished, obtaining a weight matrix in the Skip-Gram model, and obtaining the semantic vector corresponding to the candidate keyword according to the weight matrix.

In summary, the exemplary embodiments of the present disclosure provide an object query method based on keyword extraction, in which the method obtains candidate keywords by segmenting an obtained subject text of an object to be queried, establishes a graph model according to semantic similarity between the candidate keywords, and finally determines weights of vertices by using a computation method of graph model vertex weights, thereby determining target keywords, and querying according to the target keywords to obtain information about the object to be queried. Compared with the prior art that text keywords are extracted through a word frequency and other statistical methods to perform object query, the semantic understanding of words in a text can be increased, the association degree between the words is determined, the importance degree of the words in the text is further determined, the extraction accuracy of the text keywords is further improved, and the accuracy of querying object information through the keywords is further improved.

Fig. 3 shows another flow of an exemplary embodiment of the present disclosure, which may include the following steps S310 to S360:

and S310, performing word segmentation on the text of the theme of the object to be queried to obtain candidate keywords.

And S320, calculating semantic similarity among the candidate keywords by adopting a Word2vec model.

Determining a word vector model for calculating the semantic similarity of the candidate keywords through the general corpus and the target scene corpus, calculating the semantic vectors of the candidate keywords according to the word vector model, calculating the similarity of the semantic vectors, and taking the similarity as the semantic similarity of the candidate keywords.

And S330, selecting candidate keywords corresponding to the first N semantic similarities, and establishing a graph model.

And determining the connection relation between corresponding vertexes according to the semantic similarity, selecting each vertex with the semantic similarity between the vertexes being the top N as the vertex of the graph model, establishing edges between the vertexes corresponding to the semantic similarity, and taking the value of the semantic similarity as the weight of the edges so as to establish the graph model.

And S340, judging whether the graph model is a connected graph or not.

When the graph model is a connected graph, executing step S350; and when the graph model is not the connected graph, increasing N, executing step S330, and judging again until the graph model is the connected graph.

And S350, calculating the vertex weight by adopting an iterative algorithm of the vertex weight until a convergence condition is met.

And (4) adopting a graph sorting algorithm, iteratively calculating the weight of the vertex, wherein the weight obtained when the weight meets the convergence condition is the weight corresponding to each candidate keyword.

S360, the weights are sorted from high to low, vertexes corresponding to the first K weights are selected, candidate keywords corresponding to the vertexes are determined as target keywords related to the object to be queried, and the target keywords can embody core semantics of a subject text of the object to be queried in a corpus scene.

Step S370, when a query request containing the target keywords is received, adding the object to be queried to a query result of the query request.

Further, an exemplary embodiment of the present disclosure also provides an object query apparatus based on keyword extraction, and as shown in fig. 4, the object query apparatus 400 may include: the processing module 410 is configured to perform word segmentation on the subject text of the object to be queried to obtain a plurality of candidate keywords; the establishing module 420 is configured to establish a graph model by using one or more candidate keywords as vertices according to semantic similarity between the candidate keywords; a calculating module 430, configured to determine a weight of a vertex in the graph model based on an iterative algorithm of vertex weights; the determining module 440 is configured to determine, according to the weight of the vertex, a target keyword related to the object to be queried from the candidate keywords corresponding to the vertex; the adding module 450 is configured to, when a query request including the target keyword is received, add the object to be queried to a query result of the query request.

In an exemplary embodiment of the disclosure, the establishing module 420 may further include: the semantic vector unit is used for obtaining semantic vectors of the candidate keywords by utilizing the word vector model; and the calculating unit is used for calculating the similarity of any two semantic vectors to serve as the semantic similarity between the candidate keywords corresponding to the two semantic vectors.

In an exemplary embodiment of the present disclosure, the semantic vector unit may further include: the obtaining subunit is used for obtaining the general corpus and the target scene corpus; the word segmentation subunit is used for respectively segmenting the texts in the general corpus set and the texts in the target scene corpus set to obtain a general word bank and a target scene word bank; and the model subunit trains according to the general word stock and the target scene word stock and obtains a word vector model.

In an exemplary embodiment of the present disclosure, a graph model includes an undirected graph; the building module 420 may be configured to treat each candidate keyword as a vertex; for each candidate keyword, arranging word pairs formed by the candidate keyword and other candidate keywords respectively according to the sequence of semantic similarity from high to low, selecting the first N word pairs, and establishing edges between two vertexes corresponding to each word pair respectively to construct an undirected graph; wherein N is a preset positive integer.

In an exemplary embodiment of the present disclosure, after constructing the undirected graph, the calculating module 430 may further include a determining unit, configured to determine whether the undirected graph is a connected graph; if the undirected graph is a connected graph, executing an iterative algorithm based on vertex weight, and determining the weight of the vertex in the undirected graph; if the undirected graph is not the connected graph, increasing N, and reestablishing the undirected graph until the undirected graph is the connected graph.

In an exemplary embodiment of the disclosure, the calculation module 430 may be further configured to obtain an initial weight of each vertex in the graph model; and (3) adopting a text sorting algorithm, taking semantic similarity between candidate keywords corresponding to any two vertexes as the edge weight between the two vertexes, iteratively updating the weight of each vertex in the graph model until convergence, and determining the weight of each vertex.

In an exemplary embodiment of the disclosure, the determining module 440 may be configured to rank the vertices according to a sequence from high to low of the weight, and determine candidate keywords corresponding to the first K vertices as target keywords related to the object to be queried; wherein K is a preset positive integer.

In an exemplary embodiment of the present disclosure, an object to be queried may include a commodity; the subject text may include any one or more of: title of the goods, name of the goods, brief introduction of the goods.

The specific details of each module in the above apparatus have been described in detail in the method section, and details of an undisclosed scheme may refer to the method section, and thus are not described again.

As will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method or program product. Accordingly, various aspects of the present disclosure may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

Exemplary embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, various aspects of the disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps according to various exemplary embodiments of the disclosure described in the above-mentioned "exemplary methods" section of this specification, when the program product is run on the terminal device.

Referring to fig. 5, a program product 500 for implementing the above method according to an exemplary embodiment of the present disclosure is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

Program product 500 may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

The exemplary embodiment of the present disclosure also provides an electronic device capable of implementing the above method. An electronic device 600 according to this exemplary embodiment of the present disclosure is described below with reference to fig. 6. The electronic device 600 shown in fig. 6 is only an example and should not bring any limitations to the function and scope of use of the embodiments of the present disclosure.

As shown in fig. 6, electronic device 600 may take the form of a general purpose computing device. The components of the electronic device 600 may include, but are not limited to: the at least one processing unit 610, the at least one memory unit 620, a bus 630 connecting the various system components (including the memory unit 620 and the processing unit 610), and a display unit 640.

Wherein the storage unit 620 stores program code that may be executed by the processing unit 610, such that the processing unit 610 performs the steps according to various exemplary embodiments of the present disclosure described in the above-mentioned "exemplary methods" section of this specification. For example, processing unit 610 may perform the method steps shown in fig. 1-3, and so on.

The storage unit 620 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)621 and/or a cache memory unit 622, and may further include a read only memory unit (ROM) 623.

The storage unit 620 may also include a program/utility 624 having a set (at least one) of program modules 625, such program modules 625 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 630 may be one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 600 may also communicate with one or more external devices 700 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 600, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 600 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 650. Also, the electronic device 600 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 660. As shown, the network adapter 660 communicates with the other modules of the electronic device 600 over the bus 630. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 600, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit, according to exemplary embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Furthermore, the above-described figures are merely schematic illustrations of processes included in methods according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the exemplary embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to make a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) execute the method according to the exemplary embodiments of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims

1. An object query method based on keyword extraction is characterized by comprising the following steps:

performing word segmentation on a subject text of an object to be queried to obtain a plurality of candidate keywords;

according to the semantic similarity among the candidate keywords, establishing a graph model by taking one or more candidate keywords as vertexes;

determining the weight of the vertex in the graph model based on an iterative algorithm of vertex weight;

determining a target keyword related to the object to be queried from the candidate keywords corresponding to the vertex according to the weight of the vertex;

and when a query request containing the target keyword is received, adding the object to be queried to a query result of the query request.

2. The object query method of claim 1, wherein the semantic similarity between the candidate keywords is obtained by:

3. The object query method of claim 2, wherein the word vector model is obtained by:

acquiring a general corpus and a target scene corpus;

respectively segmenting the texts in the general corpus set and the texts in the target scene corpus set to obtain a general word bank and a target scene word bank;

and training according to the general word stock and the target scene word stock to obtain the word vector model.

4. The object query method of claim 1, wherein the graph model comprises an undirected graph; the establishing of the graph model by taking one or more candidate keywords as vertexes according to the semantic similarity among the candidate keywords comprises the following steps:

taking each candidate keyword as a vertex;

for each candidate keyword, arranging word pairs formed by the candidate keyword and other candidate keywords respectively according to the sequence of the semantic similarity from high to low, selecting the first N word pairs, and establishing an edge between two vertexes corresponding to each word pair respectively to construct the undirected graph;

wherein N is a preset positive integer.

5. The object query method of claim 4, wherein after constructing the undirected graph, the method further comprises:

judging whether the undirected graph is a connected graph or not;

if the undirected graph is a connected graph, executing the iterative algorithm based on the vertex weight, and determining the weight of the vertex in the undirected graph;

6. The object query method of claim 1, wherein the vertex weight-based iterative algorithm, determining the weight of the vertex in the graph model comprises:

acquiring initial weight of each vertex in the graph model;

and adopting a text sorting algorithm, taking semantic similarity between candidate keywords corresponding to any two vertexes as the edge weight between the two vertexes, iteratively updating the weight of each vertex in the graph model until convergence, and determining the weight of each vertex.

7. The object query method according to claim 1, wherein the determining, according to the weight of the vertex, a target keyword related to the object to be queried from the candidate keywords corresponding to the vertex comprises:

the vertexes are ranked according to the sequence of the weights from high to low, and the candidate keywords corresponding to the first K vertexes are determined as the target keywords related to the object to be inquired;

wherein K is a preset positive integer.

8. The object query method according to claim 1, wherein the object to be queried comprises a commodity; the subject text includes any one or more of: title of the goods, name of the goods, brief introduction of the goods.

9. An object query device based on keyword extraction, the device comprising:

the processing module is used for carrying out word segmentation processing on the subject text of the object to be inquired to obtain a plurality of candidate keywords;

the establishing module is used for establishing a graph model by taking one or more candidate keywords as vertexes according to the semantic similarity among the candidate keywords;

the calculation module is used for determining the weight of the vertex in the graph model based on an iterative algorithm of vertex weight;

the determining module is used for determining a target keyword related to the object to be queried from the candidate keywords corresponding to the vertex according to the weight of the vertex;

and the adding module is used for adding the object to be queried to a query result of the query request when the query request containing the target keyword is received.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1 to 8.

11. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the method of any of claims 1-8 via execution of the executable instructions.