CN114329225A

CN114329225A - Search method, device, equipment and storage medium based on search statement

Info

Publication number: CN114329225A
Application number: CN202210081578.9A
Authority: CN
Inventors: 邹若奇
Original assignee: Ping An International Smart City Technology Co Ltd
Current assignee: Ping An International Smart City Technology Co Ltd
Priority date: 2022-01-24
Filing date: 2022-01-24
Publication date: 2022-04-12
Anticipated expiration: 2042-01-24
Also published as: CN114329225B

Abstract

The invention relates to the field of big data and discloses a search method, a search device, search equipment and a storage medium based on search sentences. The method comprises the following steps: acquiring a search sentence and a text data set, segmenting and coding the search sentence and the text data set respectively to obtain at least one search keyword vector and a plurality of text segmentation vectors, and performing named entity identification and semantic role prediction on vectors; calculating text similarity between a search sentence and each text data in a text data set based on a search keyword vector and a text segmentation vector, calculating entity similarity and semantic role similarity based on the results of named entity identification and semantic role prediction respectively, and further calculating global similarity according to the text similarity, the semantic role similarity and the entity similarity; and displaying the webpage links corresponding to the text data in a descending order according to the size of the global similarity. According to the method, the search sentences are subjected to data matching through the text dimension, the entity dimension and the semantic role dimension, and the search is more accurate.

Description

Search method, device, equipment and storage medium based on search statement

Technical Field

The invention relates to the field of big data, in particular to a search method, a search device, search equipment and a storage medium based on search sentences.

Background

With the application and development of big data, how to accurately search data becomes a problem which needs to be solved at present, and in most search engines, the association degree between a data object and a search statement is determined through the statistics of co-occurrence words.

The existing search method based on search sentences has single search dimension, and semantic relation between the search sentences and data objects cannot be reflected, so that the search accuracy is low.

Disclosure of Invention

The invention mainly aims to solve the problem of low accuracy of the existing search method based on search sentences.

The invention provides a search method based on a search statement in a first aspect, which comprises the following steps:

acquiring a search sentence input by a user and a text data set in a preset search resource pool, performing word segmentation on each text data in the search sentence and the text data set respectively, and performing vectorization coding on word segmentation results respectively to obtain at least one search keyword vector and a plurality of text word segmentation vectors respectively;

performing named entity recognition on the at least one search keyword vector and the plurality of text participle vectors;

performing semantic role prediction on the at least one search keyword vector and the plurality of text participle vectors;

calculating a text similarity between the search sentence and each piece of text data in the text data set based on the at least one search keyword vector and the plurality of text segmentation vectors, calculating an entity similarity between the search sentence and each piece of text data in the text data set based on a result of named entity recognition, and calculating a semantic role similarity between the search sentence and each piece of text data in the text data set based on a result of semantic role prediction;

calculating at least one of text similarity, semantic role similarity and entity similarity between the search statement and each piece of text data in the text data set based on a preset calculation rule to obtain global similarity between the search statement and each piece of text data in the text data set;

and acquiring the webpage links corresponding to each piece of text data, sorting the webpage links in a descending order according to the global similarity, and outputting and displaying the sorting result on the terminal.

Optionally, in a first implementation manner of the first aspect of the present invention, the performing named entity recognition on the at least one search keyword vector and the plurality of text participle vectors includes:

acquiring a preset initial training data set, and constructing a data set to be identified based on the at least one search keyword vector and the plurality of text word segmentation vectors;

taking the initial training data set as a first round of training data set, and performing a first round of supervised training on a preset named entity recognition model;

carrying out named entity recognition and labeling on the data set to be recognized based on the named entity recognition model after the first round of supervised training to obtain a weakly labeled data set to be recognized;

and extracting a subset from the weakly labeled data set to be recognized obtained in the current round, adding the subset into the initial training data set to obtain a second round of training data set, performing supervised training again on the named entity recognition model after the first round of supervised training based on the second round of training data set, and performing multiple rounds of training until the named entity recognition model is converged, and outputting the results of entity recognition and labeling of the data set to be recognized in the current round.

Optionally, in a second implementation manner of the first aspect of the present invention, the performing a first round of supervised training on a preset named entity recognition model with the initial training dataset as a first round of training dataset includes:

calling a CRF (fuzzy C-means) network in the named entity recognition model to process the first round of training data set to obtain a probability matrix of each sentence in the first round of training data set;

calculating the probability matrix of each sentence based on the Viterbi algorithm to obtain an optimal labeling sequence;

and adjusting parameters of the named entity recognition model according to the recognition result in the optimal labeling sequence and the labeling result in the first round of training data set.

Optionally, in a third implementation manner of the first aspect of the present invention, the performing semantic role prediction on the at least one search keyword vector and the text participle vectors includes:

sequentially performing forward sequence part-of-speech analysis and reverse sequence part-of-speech analysis on a target vector based on a preset part-of-speech analysis model, and determining part-of-speech types of participles corresponding to the target vector according to an analysis result, wherein the target vector comprises the at least one search keyword vector and the plurality of text vectors;

searching part-of-speech vectors of the participles corresponding to the target vectors in a preset part-of-speech vector library according to the part-of-speech types of the participles corresponding to the target vectors;

and sequentially performing forward-order semantic role analysis and reverse-order semantic role analysis on the part-of-speech vectors of the participles corresponding to the target vector based on a preset role analysis model, and determining the semantic role types of the search keyword vectors and the semantic role types of each text participle vector according to the analysis result.

Optionally, in a fourth implementation manner of the first aspect of the present invention, the performing, based on a preset role analysis model, forward order semantic role analysis and reverse order semantic role analysis on part-of-speech vectors of the participles corresponding to the target vector in sequence, and determining, according to an analysis result, a semantic role type of the search keyword vector and a semantic role type of each text participle vector includes:

on the basis of a preset role analysis model, sequentially performing forward-order semantic role analysis and reverse-order semantic role analysis on part-of-speech vectors of the participles corresponding to the target vector to obtain a first output vector and a second output vector corresponding to each part-of-speech vector;

calculating a second output vector and a second output vector corresponding to each part of speech vector according to a preset probability function to obtain a semantic role probability vector of a participle corresponding to each part of speech vector;

processing semantic role probability vectors of the participles corresponding to each part of speech vector based on a maximum independent variable point set algorithm to obtain a sequence number for representing the semantic role type;

and determining the semantic role type of the search keyword vector and the semantic role type of each text participle vector according to the sequence number for representing the semantic role type.

Optionally, in a fifth implementation manner of the first aspect of the present invention, the calculating, based on a preset calculation rule, at least one of a text similarity, a semantic role similarity, and an entity similarity between the search statement and each piece of text data in the text data set, and obtaining a global similarity between the search statement and each piece of text data in the text data set includes:

when the number of the search keyword vectors is within a first preset range, taking the text similarity between the search statement and each piece of text data in the text data set as a global similarity;

and when the number of the search keyword vectors is within a second preset range, multiplying the entity similarity between the search statement and each text data in the text data set by the semantic role similarity to obtain the global similarity between the search statement and each text data in the text data set.

Optionally, in a sixth implementation manner of the first aspect of the present invention, after the obtaining a global similarity between the search statement and each piece of text data in the text data set by multiplying an entity similarity between the search statement and each piece of text data in the text data set by a semantic role similarity when the number of the search keyword vectors is within a second preset range, the method further includes:

when the number of the search keyword vectors is within a third preset range, multiplying the text similarity between the search statement and each piece of text data in the text data set by the semantic role similarity to obtain the global similarity between the search statement and each piece of text data in the text data set;

and when the number of the search keyword vectors is within a fourth preset range, multiplying the text similarity between the search statement and each text data in the text data set by the semantic role similarity and the entity similarity in sequence to obtain the global similarity between the search statement and each text data in the text data set.

The second aspect of the present invention provides a search apparatus based on a search sentence, including:

the word vector generation module is used for acquiring a search sentence input by a user and a text data set in a preset search resource pool, performing word segmentation on each piece of text data in the search sentence and the text data set respectively, and performing vectorization coding on word segmentation results respectively to obtain at least one search keyword vector and a plurality of text word segmentation vectors respectively;

a named entity recognition module for performing named entity recognition on the at least one search keyword vector and the plurality of text participle vectors;

a semantic role prediction module for performing semantic role prediction on the at least one search keyword vector and the plurality of text participle vectors;

a similarity calculation module, configured to calculate a text similarity between the search sentence and each piece of text data in the text data set based on the at least one search keyword vector and the plurality of text segmentation vectors, calculate an entity similarity between the search sentence and each piece of text data in the text data set based on a result of named entity identification, and calculate a semantic role similarity between the search sentence and each piece of text data in the text data set based on a result of semantic role prediction;

the global similarity calculation module is used for calculating at least one of text similarity, semantic role similarity and entity similarity between the search statement and each piece of text data in the text data set based on a preset calculation rule to obtain global similarity between the search statement and each piece of text data in the text data set;

and the visualization module is used for acquiring the webpage links corresponding to each piece of text data, sorting the webpage links in a descending order according to the global similarity, and outputting and displaying the sorting result on the terminal.

Optionally, in a first implementation manner of the second aspect of the present invention, the named entity identifying module specifically includes:

the data set construction unit is used for acquiring a preset initial training data set and constructing a data set to be identified based on the at least one search keyword vector and the text word segmentation vectors;

the supervised training unit is used for carrying out first round of supervised training on a preset named entity recognition model by taking the initial training data set as a first round of training data set;

the recognition and labeling unit is used for recognizing and labeling named entities of the data set to be recognized based on the named entity recognition model after the first round of supervised training to obtain a weakly labeled data set to be recognized;

and the iterative training unit is used for extracting a subset from the weakly labeled data set to be recognized obtained in the current round, adding the subset into the initial training data set to obtain a second round of training data set, performing supervised training on the named entity recognition model after the first round of supervised training again based on the second round of training data set, and performing multiple rounds of training until the named entity recognition model is converged, and outputting the results of entity recognition and labeling of the data set to be recognized in the current round.

Optionally, in a second implementation manner of the second aspect of the present invention, the supervised training unit is configured to:

Optionally, in a third implementation manner of the second aspect of the present invention, the semantic role prediction module specifically includes:

the part-of-speech analysis unit is used for sequentially carrying out forward-sequence part-of-speech analysis and reverse-sequence part-of-speech analysis on the target vector based on a preset part-of-speech analysis model, and determining the part-of-speech type of the participle corresponding to the target vector according to the analysis result, wherein the target vector comprises the at least one search keyword vector and the plurality of text vectors;

the vector obtaining unit is used for searching part-of-speech vectors of the participles corresponding to the target vectors in a preset part-of-speech vector library according to the part-of-speech types of the participles corresponding to the target vectors;

and the role analysis unit is used for sequentially carrying out forward-order semantic role analysis and reverse-order semantic role analysis on the part of speech vectors of the participles corresponding to the target vector based on a preset role analysis model, and determining the semantic role types of the search keyword vectors and the semantic role types of each text participle vector according to the analysis result.

Optionally, in a fourth implementation manner of the second aspect of the present invention, the role analysis unit is specifically configured to:

Optionally, in a fifth implementation manner of the second aspect of the present invention, the global similarity calculation module specifically includes:

the first calculation unit is used for taking the text similarity between the search statement and each piece of text data in the text data set as the global similarity when the number of the search keyword vectors is within a first preset range;

and the second calculating unit is used for multiplying the entity similarity between the search statement and each piece of text data in the text data set by the semantic role similarity when the number of the search keyword vectors is within a second preset range, so as to obtain the global similarity between the search statement and each piece of text data in the text data set.

Optionally, in a sixth implementation manner of the second aspect of the present invention, the global similarity calculation module specifically includes:

the second calculation unit is used for multiplying the entity similarity between the search statement and each piece of text data in the text data set by the semantic role similarity when the number of the search keyword vectors is within a second preset range, so as to obtain the global similarity between the search statement and each piece of text data in the text data set;

the third calculating unit is used for multiplying the text similarity between the search statement and each piece of text data in the text data set by the semantic role similarity when the number of the search keyword vectors is within a third preset range, so as to obtain the global similarity between the search statement and each piece of text data in the text data set;

and the fourth calculating unit is used for multiplying the text similarity between the search statement and each piece of text data in the text data set by the semantic role similarity and the entity similarity in sequence when the number of the search keyword vectors is within a fourth preset range, so as to obtain the global similarity between the search statement and each piece of text data in the text data set.

A third aspect of the present invention provides a search apparatus based on a search sentence, including: a memory and at least one processor, the memory having instructions stored therein; the at least one processor invokes the instructions in the memory to cause the search statement based search apparatus to perform the search statement based search method described above.

A fourth aspect of the present invention provides a computer-readable storage medium having stored therein instructions, which, when run on a computer, cause the computer to execute the above-described search sentence-based search method.

According to the technical scheme, corresponding vectors are generated by obtaining a search statement and a text data set in a search resource pool, segmenting and vectorizing the text data set, named entity recognition and semantic role prediction are sequentially carried out on the generated vectors, then the text similarity, the entity similarity and the semantic role similarity between the search statement and each piece of data in the text data set are respectively calculated, finally the text similarity, the entity similarity and the semantic role similarity are calculated according to a preset calculation rule, so that the global similarity is obtained, and finally corresponding webpage links in the text data set are displayed in a descending order according to the size of the global similarity. According to the method, the search sentences are subjected to data matching through the text dimension, the entity dimension and the semantic role dimension, and the search is more accurate.

Drawings

FIG. 1 is a schematic diagram of a first embodiment of a search statement-based search method according to an embodiment of the present invention;

FIG. 2 is a diagram of a second embodiment of the search sentence-based search method according to the embodiment of the present invention;

FIG. 3 is a diagram of a third embodiment of a search method based on a search statement in the embodiment of the present invention;

FIG. 4 is a diagram of a fourth embodiment of the search method based on the search statement in the embodiment of the present invention;

FIG. 5 is a schematic diagram of an embodiment of a search apparatus based on a search statement in the embodiment of the present invention;

FIG. 6 is a schematic diagram of another embodiment of a search apparatus based on a search statement in the embodiment of the present invention;

fig. 7 is a schematic diagram of an embodiment of a search apparatus based on a search statement in the embodiment of the present invention.

Detailed Description

The embodiment of the invention provides a searching method, a searching device, searching equipment and a storage medium based on a searching statement, and the searching result is more accurate.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," or "having," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The server in the invention can be an independent server, and can also be a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, Network service, cloud communication, middleware service, domain name service, security service, Content Delivery Network (CDN), big data and artificial intelligence platform and the like.

For convenience of understanding, a specific flow of the embodiment of the present invention is described below, and referring to fig. 1, an embodiment of a search method based on a search statement in the embodiment of the present invention includes:

101. acquiring a search sentence input by a user and a text data set in a preset search resource pool, performing word segmentation on each text data in the search sentence and the text data set respectively, and performing vectorization coding on word segmentation results respectively to obtain at least one search keyword vector and a plurality of text word segmentation vectors respectively;

it can be understood that, based on the possible difference between the corresponding search resource pools of different search engines, the search resource pools crawl the web page contents in a web page crawler manner to construct an index database, and then the server processes the search sentences input by the user according to a specific matching algorithm to match corresponding web page texts, wherein the text data in the text data set is the web page data crawled by the web page crawler.

Specifically, the server performs word segmentation on each piece of text data in the search sentence and the text data set by calling a word segmentation tool, for example, Jieba, SnowNLP, PkuSeg, and the like, which is not limited in this embodiment. And the server divides the search sentence into at least one participle by being limited by the length of the search sentence, and then vectorially encodes each participle into a corresponding search keyword vector.

For the calculation processing, the server performs one-hot coding on each participle based on a preset vocabulary (vocabularies), so that the original text representation is converted into a vector representation for the calculation processing of the network model. For example, the server firstly extracts 10000 unique and nonrepeating words from a training document to form a vocabulary, secondly performs one-hot coding on the 10000 words, each obtained word is a 10000-dimensional vector, the value of each dimension of the vector is only 0 or 1, and if the appearance position of a word ants in the vocabulary is the 3 rd, the vector of the ants is the 10000-dimensional vector with the value of the third dimension being 1 and the other dimensions being 0. Specifically, for example, if The search sentence is "The dog barked at The main", then The server can construct a vocabulary of size 5 (ignoring capitalization and punctuation): ("the", "dog", "barked", "at", "mailman"), further numbers the words of this vocabulary from 0 to 4. Then "dog" can be expressed as a 5-dimensional vector [0,1,0,0,0 ].

102. Performing named entity recognition on at least one search keyword vector and a plurality of text participle vectors;

it will be appreciated that named entity identification is the identification of entities from unstructured text that have a particular meaning, such as a person's name, place name, organization name, proper noun, and the like. The server respectively performs named entity recognition on the search keyword vector and the text segmentation vector, specifically, the server first recognizes a boundary of an entity, and then recognizes a category corresponding to the entity, and the recognition methods are, for example, named entity recognition based on a Conditional Random Field (CRF), named entity recognition based on multiple features, and the like, which is not limited in this embodiment.

103. Performing semantic role prediction on at least one search keyword vector and a plurality of text participle vectors;

it can be understood that Semantic Role prediction (SRL) is one of the core tasks of sentence analysis, and the server performs Semantic Role prediction on a search keyword vector and a text analysis vector to recover a predicate-argument structure therein, thereby making a basic judgment: "who does what to whom", "when" and "where", where a predicate refers to a word in a sentence that describes or judges a subject, usually a verb; argument refers to a noun collocated with predicate in a sentence; semantic roles refer to the role or role that it plays when it is engaged with a verb, such as time, place, actor, victim, object, experiencer, beneficiary, tool, target, source, etc., as indicated on an argument basis. Specifically, the server may perform semantic role prediction on the search keyword vector and the text participle vector by using a shallow syntactic analysis-based method, and the like.

104. Calculating a text similarity between the search sentence and each piece of text data in the text data set based on the at least one search keyword vector and the plurality of text segmentation vectors, calculating an entity similarity between the search sentence and each piece of text data in the text data set based on a result of named entity recognition, and calculating a semantic role similarity between the search sentence and each piece of text data in the text data set based on a result of semantic role prediction;

it can be understood that the server measures the similarity between the search sentence and each piece of text data in the text data set from multiple dimensions, and takes the corresponding similarity in each dimension as a similarity component of the global similarity, where the similarity component includes the text similarity (i.e., the similarity between characters), the semantic role similarity, and the named entity similarity.

Specifically, the server may calculate the similarity between each search keyword vector a in the search sentence a and each text segment vector B in the text data B based on similarity algorithms such as Euclidean distance (Euclidean distance), Pearson correlation coefficient (Pearson correlation coefficient), cosine similarity theorem, and the like, which is not limited in this embodiment. And the server sums up the calculated similarities to obtain the text similarity between the search sentence a and the text data B, for example, the search keyword vector package a1 and a2 corresponding to the search sentence a, and the text segmentation vector corresponding to the text data B includes B1, B2 and B3, so that the text similarity between the search sentence a and the text data B, T _ Sim (a, B), T _ Sim (a1, B1) + T _ Sim (a1, B2) + T _ Sim (a1, B3) + T _ Sim (a2, B1) + T _ Sim (a2, B2) + T _ Sim (a2, B3), is obtained.

Further, please refer to formula one and formula two for the calculation of semantic role similarity and named entity similarity.

Wherein, P _ Sim (A, B) is semantic role similarity between the search sentence A and the text data B, | A | is total part of speech sequence in the search sentence A, P _ Seq_A,BIs the total number of overlapping parts-of-speech occurrences in the search sentence a and the text data B.

Similarly, N _ Sim (A, B) is the named entity similarity between the search sentence A and the text data B, | A | is the total number of named entities in the search sentence A, P _ Seq_A,BThe total number of entities in which the overlapping entity types appear in the search sentence a and the text data B.

105. Calculating at least one of text similarity, semantic role similarity and entity similarity between the search statement and each piece of text data in the text data set based on a preset calculation rule to obtain global similarity between the search statement and each piece of text data in the text data set;

it can be understood that the global similarity is further calculated based on at least one of the text similarity, the semantic similarity and the entity similarity, the global similarity can be calculated by selecting at least one of the similarities according to the actual service scene, when only one of the similarities is selected, the global similarity is directly used, when two or three (i.e., all) of the similarities are selected, each of the similarities are multiplied in sequence, and the result of the final multiplication or the cumulative multiplication is used as the global similarity. Preferably, the server selects a great variety of similarities as much as possible, so as to calculate the global similarity by using a plurality of dimensions, please refer to formula three:

Sim(A,B)＝T_Sim(A,B)*P_Sim(A,B)*N_Sim(A,B)

formula three

Wherein, T _ Sim (A, B) is text similarity, P _ Sim (A, B) is semantic role similarity, N _ Sim (A, B) is entity similarity, A is search sentence, and B is text data.

106. And acquiring the webpage links corresponding to each piece of text data, sorting the webpage links in a descending order according to the global similarity, and outputting and displaying the sorting result on the terminal.

It can be understood that the text data is a static text resource obtained by performing a data crawler (web crawler) on an original web page by using a service, and then the server executes a Natural Language Processing task (NLP) by using the static text resource as a calculation object to obtain a global similarity between a search statement and each static text resource, and further performs a descending order on web page links of the original web page corresponding to the static text resource, so that the user preferentially browses a web page most closely associated with the search statement. When a user clicks a web page link, the user quickly jumps to the corresponding web page content through a Uniform Resource Locator (URL) included in the web page link element.

In the embodiment, data matching is performed on the search sentences through the text dimension, the entity dimension and the semantic role dimension, so that the search is more accurate.

Referring to fig. 2, a second embodiment of the search method based on the search statement according to the embodiment of the present invention includes:

201. acquiring a search sentence input by a user and a text data set in a preset search resource pool, performing word segmentation on each text data in the search sentence and the text data set respectively, and performing vectorization coding on word segmentation results respectively to obtain at least one search keyword vector and a plurality of text word segmentation vectors respectively;

step 201 is similar to the step 101, and is not described herein again.

202. Acquiring a preset initial training data set, and constructing a data set to be identified based on at least one search keyword vector and a plurality of text word segmentation vectors;

it will be appreciated that the initial training data set may be a recognition training data set for entities disclosed in the network, such as phenomenological words, names of people, names of places, etc. in various domains.

203. Taking the initial training data set as a first round of training data set, and performing a first round of supervised training on a preset named entity recognition model;

it can be understood that the server calls a CRF layer network in the named entity recognition model to process the first round of training data set to obtain a probability matrix of each sentence in the first round of training data set, then calculates the probability matrix of each sentence based on a viterbi algorithm to obtain an optimal labeling sequence, and finally adjusts parameters of the named entity recognition model according to a recognition result in the optimal labeling sequence and a labeling result in the first round of training data set, thereby ending the first round of supervised training.

Specifically, the server calls a CRF network in the named entity recognition model, and outputs each data in the input first training data set to be a probability matrix meeting the Markov random field according to the conditional probability through a conditional random field algorithm. Wherein, the probability matrix is composed of label probability sequences corresponding to all words in each sentence, and the label probability sequences are used for representing the probability distribution between a certain word and each entity label. The Viterbi Algorithm (Viterbi Algorithm) is a dynamic programming Algorithm by which the server decodes the probability matrix to determine the optimal label sequence therein.

Further, the server can measure the recognition accuracy of the named entity recognition model by observing the deviation between the actual recognition result and the original labeling result, preferably, the server can quantitatively calculate the deviation through a loss function to obtain a corresponding loss value, and when the loss value is smaller than a preset threshold value or reaches a minimum value, the current named entity recognition model reaches the optimal performance.

204. Carrying out named entity recognition and labeling on a data set to be recognized based on a named entity recognition model after the first round of supervised training to obtain a weakly labeled data set to be recognized;

it can be understood that the named entity recognition model after the first round of supervised training can roughly and accurately recognize the entity and label the corresponding recognition result, and although the result is not high in accuracy, the data set to be recognized can be weakly labeled, so that the weakly labeled data set is used for the next round of supervised training of the named recognition model, and the updating of the sample is realized in a self-learning manner of the data to be recognized.

205. Extracting a subset from the weakly labeled data set to be recognized obtained in the current round, adding the subset into the initial training data set to obtain a second round of training data set, performing supervised training on the named entity recognition model after the first round of supervised training again based on the second round of training data set, and performing multi-round training in the way until the named entity recognition model is converged, and outputting the results of entity recognition and labeling of the data set to be recognized in the current round;

it should be understood that the subset extracted by the server is composed of sentences in the weakly labeled data set to be recognized, wherein the confidence of the sentences is greater than or equal to a confidence threshold value, and the confidence of the sentences refers to the average probability value of the identification tags of all words labeled as entities in the sentences. Preferably, the confidence threshold is 0.8. Taking the sentence "Tony and Tom arefrinds" as an example, the recognition tags for recognizing the words are B-Person O in turn, and assuming that the probability values of the recognition tags are 78%, 90%, 88%, 91%, 89% in turn, the average probability value of the recognition tags of all the words labeled as entities in the sentence is (78% + 88%)/2 is 0.83, and the confidence 0.83 of the sentence is greater than the confidence threshold 0.8, the sentence is selected into the subset.

206. Performing semantic role prediction on at least one search keyword vector and a plurality of text participle vectors;

207. calculating a text similarity between the search sentence and each piece of text data in the text data set based on the at least one search keyword vector and the plurality of text segmentation vectors, calculating an entity similarity between the search sentence and each piece of text data in the text data set based on a result of named entity recognition, and calculating a semantic role similarity between the search sentence and each piece of text data in the text data set based on a result of semantic role prediction;

208. calculating at least one of text similarity, semantic role similarity and entity similarity between the search statement and each piece of text data in the text data set based on a preset calculation rule to obtain global similarity between the search statement and each piece of text data in the text data set;

209. and acquiring the webpage links corresponding to each piece of text data, sorting the webpage links in a descending order according to the global similarity, and outputting and displaying the sorting result on the terminal.

Wherein, the step 206-209 is similar to the step 106-103, and detailed description thereof is omitted here.

In the embodiment, the process of carrying out named entity recognition on the search sentences and the text data is described in detail, and the recognition result is more accurate by carrying out multiple rounds of supervised training on the named entity recognition model by using the training samples and the samples to be recognized and self-learning the samples to be recognized.

Referring to fig. 3, a third embodiment of the search method based on the search statement according to the embodiment of the present invention includes:

301. acquiring a search sentence input by a user and a text data set in a preset search resource pool, performing word segmentation on each text data in the search sentence and the text data set respectively, and performing vectorization coding on word segmentation results respectively to obtain at least one search keyword vector and a plurality of text word segmentation vectors respectively;

302. performing named entity recognition on at least one search keyword vector and a plurality of text participle vectors;

wherein, the steps 301-302 are similar to the steps 101-102 described above, and detailed description thereof is omitted here.

303. Sequentially performing forward sequence part-of-speech analysis and reverse sequence part-of-speech analysis on a target vector based on a preset part-of-speech analysis model, and determining part-of-speech types of participles corresponding to the target vector according to an analysis result, wherein the target vector comprises at least one search keyword vector and a plurality of text vectors;

it can be understood that, in the part of speech analysis, not only the attribute of the part of speech itself but also the dependency relationship between preceding and following words need to be considered to finally determine the part of speech, so the order of the part of speech analysis has a certain influence on the analysis result. In this embodiment, the server performs part-of-speech analysis from the forward sequence and the reverse sequence to obtain corresponding part-of-speech probability vectors, and sets corresponding weights for the part-of-speech probability vectors respectively and then sums the weights to obtain a final part-of-speech probability vector. Preferably, the weight ratio between the part-of-speech probability vector of the forward part-of-speech analysis and the reverse part-of-speech probability vector is 3: 1. Further, the server calculates the final part-of-speech probability vector through a maximum independent variable function to obtain a final part-of-speech type number, so as to determine the part-of-speech type corresponding to the participle, such as verb, noun, adjective, and the like.

304. Searching part-of-speech vectors of the participles corresponding to the target vectors in a preset part-of-speech vector library according to the part-of-speech types of the participles corresponding to the target vectors;

it can be understood that, in order to perform role analysis calculation using the part of speech type, the server converts the part of speech type into a corresponding part of speech vector, and specifically, the server matches the part of speech type according to a preset part of speech vector library to obtain a corresponding vector representation.

305. Based on a preset role analysis model, sequentially performing forward order semantic role analysis and reverse order semantic role analysis on part-of-speech vectors of the participles corresponding to the target vector, and determining semantic role types of the search keyword vectors and semantic role types of each text participle vector according to the analysis result;

it should be understood that, the server firstly performs forward semantic role analysis and reverse semantic role analysis on the part of speech vectors in sequence based on a preset role analysis model to obtain a first output vector and a second output vector corresponding to each part of speech vector; then according to a preset probability function, calculating a second output vector and a second output vector corresponding to each part of speech vector to obtain a semantic role probability vector of a participle corresponding to each part of speech vector; and finally, processing the semantic role probability vector of the participle corresponding to each part of speech vector based on a maximum independent variable point set algorithm to obtain a sequence number for representing the semantic role type so as to determine the search keyword vector and the semantic role type of each text participle vector. The server analyzes the semantic role of the participle in a similar way to the part of speech analysis process, namely positive sequence and negative sequence combined analysis to obtain a positive sequence semantic role probability vector and a negative sequence semantic role probability vector, and can also introduce corresponding weights to carry out weighted summation to the positive sequence semantic role probability vector and the negative sequence semantic role probability vector to obtain the semantic role probability vector corresponding to the participle. The server can obtain the sequence number of the semantic role type corresponding to the participle when the semantic role probability vector corresponding to the participle is the maximum value through the maximum value independent variable point set (argmax).

306. Calculating a text similarity between the search sentence and each piece of text data in the text data set based on the at least one search keyword vector and the plurality of text segmentation vectors, calculating an entity similarity between the search sentence and each piece of text data in the text data set based on a result of named entity recognition, and calculating a semantic role similarity between the search sentence and each piece of text data in the text data set based on a result of semantic role prediction;

307. calculating at least one of text similarity, semantic role similarity and entity similarity between the search statement and each piece of text data in the text data set based on a preset calculation rule to obtain global similarity between the search statement and each piece of text data in the text data set;

308. and acquiring the webpage links corresponding to each piece of text data, sorting the webpage links in a descending order according to the global similarity, and outputting and displaying the sorting result on the terminal.

The steps 306-308 are similar to the steps 104-106 described above, and detailed description thereof is omitted here.

In this embodiment, the process of semantic role prediction for search sentences and text data is described in detail, and the accuracy of the prediction result is improved by forward and reverse part-of-speech analysis and forward and reverse semantic role analysis.

Referring to fig. 4, a fourth embodiment of the search method based on the search statement according to the embodiment of the present invention includes:

401. acquiring a search sentence input by a user and a text data set in a preset search resource pool, performing word segmentation on each text data in the search sentence and the text data set respectively, and performing vectorization coding on word segmentation results respectively to obtain at least one search keyword vector and a plurality of text word segmentation vectors respectively;

402. performing named entity recognition on at least one search keyword vector and a plurality of text participle vectors;

403. performing semantic role prediction on at least one search keyword vector and a plurality of text participle vectors;

404. calculating a text similarity between the search sentence and each piece of text data in the text data set based on the at least one search keyword vector and the plurality of text segmentation vectors, calculating an entity similarity between the search sentence and each piece of text data in the text data set based on a result of named entity recognition, and calculating a semantic role similarity between the search sentence and each piece of text data in the text data set based on a result of semantic role prediction;

wherein, the

steps

401 and 404 are similar to the

steps

101 and 105, and detailed description thereof is omitted here.

405. When the number of the search keyword vectors is within a first preset range, taking the text similarity between the search statement and each piece of text data in the text data set as the global similarity;

it can be understood that the server determines the calculation mode corresponding to the global similarity by analyzing the number of the participles in the search keyword. For example, in the embodiment, the first preset range is K ∈ [1,3] and K is an integer, since the search statement is shorter and the keywords are fewer, the semantic role similarity and the named entity similarity are not considered in the global similarity, and the text similarity between the search statement and the text data is used as the global similarity;

406. when the number of the search keyword vectors is within a second preset range, multiplying the entity similarity between the search statement and each text data in the text data set by the semantic role similarity to obtain the global similarity between the search statement and each text data in the text data set;

it is understood that, for example, the second preset range is K e [3,5] and K is an integer, the server multiplies the named entity similarity between the search sentence and the text data by the semantic role similarity as the global similarity.

407. When the number of the search keyword vectors is within a third preset range, multiplying the text similarity between the search statement and each piece of text data in the text data set by the semantic role similarity to obtain the global similarity between the search statement and each piece of text data in the text data set;

it is understood that, for example, in the present embodiment, the third preset range is K e [5,7] and K is an integer, and the server multiplies the text similarity between the search sentence and the text data by the semantic role similarity, as the global similarity.

408. When the number of the search keyword vectors is within a fourth preset range, multiplying the text similarity between the search statement and each piece of text data in the text data set by the semantic role similarity and the entity similarity in sequence to obtain the global similarity between the search statement and each piece of text data in the text data set;

it is understood that, for example, in the present embodiment, the fourth preset range is K e (7, ∞) and K is an integer, when the search sentence is too long and includes too many keywords, the server multiplies the text similarity, the named entity similarity, and the semantic role similarity between the search sentence and the text data in sequence to obtain the global similarity.

409. And acquiring the webpage links corresponding to each piece of text data, sorting the webpage links in a descending order according to the global similarity, and outputting and displaying the sorting result on the terminal.

Step 409 is similar to the step 106, and is not described herein again.

In the embodiment, the calculation process of the global similarity is described in detail, and the calculation mode of the global similarity is flexibly adjusted by the number of the search keywords in the search sentence, so that the search efficiency is improved, and the calculation resources are reasonably distributed.

In the above description of the search method based on the search term in the embodiment of the present invention, referring to fig. 5, a search apparatus based on the search term in the embodiment of the present invention is described below, and an embodiment of the search apparatus based on the search term in the embodiment of the present invention includes:

a word vector generating module 501, configured to obtain a search statement input by a user and a text data set in a preset search resource pool, perform word segmentation on each piece of text data in the search statement and the text data set, and perform vectorization coding on a word segmentation result, so as to obtain at least one search keyword vector and multiple text word segmentation vectors;

a named entity recognition module 502, configured to perform named entity recognition on the at least one search keyword vector and the plurality of text participle vectors;

a semantic role prediction module 503, configured to perform semantic role prediction on the at least one search keyword vector and the text participle vectors;

a similarity component calculation module 504, configured to calculate a text similarity between the search sentence and each piece of text data in the text data set based on the at least one search keyword vector and the plurality of text segmentation vectors, calculate an entity similarity between the search sentence and each piece of text data in the text data set based on a result of named entity identification, and calculate a semantic role similarity between the search sentence and each piece of text data in the text data set based on a result of semantic role prediction;

a global similarity calculation module 505, configured to calculate at least one of a text similarity, a semantic role similarity, and an entity similarity between the search statement and each piece of text data in the text data set based on a preset calculation rule, so as to obtain a global similarity between the search statement and each piece of text data in the text data set;

and the visualization module 506 is configured to obtain the web page links corresponding to each piece of text data, sort the web page links in a descending order according to the magnitude of the global similarity, and output and display a sorting result at the terminal.

Referring to fig. 6, another embodiment of the search apparatus based on the search term according to the embodiment of the present invention includes:

The named entity identifying module 502 specifically includes:

a data set construction unit 5021, configured to obtain a preset initial training data set, and construct a data set to be identified based on the at least one search keyword vector and the text word segmentation vectors;

a supervised training unit 5022, configured to perform a first round of supervised training on a preset named entity recognition model with the initial training data set as a first round of training data set;

the identification and labeling unit 5023 is used for carrying out named entity identification and labeling on the data set to be identified based on the named entity identification model after the first round of supervised training to obtain a weakly labeled data set to be identified;

and the iterative training unit 5024 is used for extracting a subset from the weakly labeled data set to be recognized obtained in the current round, adding the subset into the initial training data set to obtain a second round of training data set, performing supervised training on the named entity recognition model after the first round of supervised training again based on the second round of training data set, and performing multiple rounds of training until the named entity recognition model converges, and outputting the results of entity recognition and labeling of the data set to be recognized in the current round.

Wherein the supervised training unit 5022 is configured to:

The semantic role prediction module 503 specifically includes:

a part-of-speech analysis unit 5031, configured to sequentially perform forward-order part-of-speech analysis and reverse-order part-of-speech analysis on a target vector based on a preset part-of-speech analysis model, and determine a part-of-speech type of a word corresponding to the target vector according to an analysis result, where the target vector includes the at least one search keyword vector and the plurality of text vectors;

a vector obtaining unit 5032, configured to search, according to the part-of-speech type of the participle corresponding to the target vector, a part-of-speech vector of the participle corresponding to the target vector in a preset part-of-speech vector library;

a role analysis unit 5033, configured to sequentially perform forward semantic role analysis and reverse semantic role analysis on the part-of-speech vectors of the participles corresponding to the target vector based on a preset role analysis model, and determine a semantic role type of the search keyword vector and a semantic role type of each text participle vector according to an analysis result.

The role analysis unit 5033 is specifically configured to:

The global similarity calculation module 505 specifically includes:

a first calculation unit 5051, configured to, when the number of the search keyword vectors is within a first preset range, take a text similarity between the search statement and each piece of text data in the text data set as a global similarity;

a second calculating unit 5052, configured to, when the number of the search keyword vectors is within a second preset range, multiply the entity similarity between the search statement and each piece of text data in the text data set by the semantic role similarity, to obtain a global similarity between the search statement and each piece of text data in the text data set;

a third calculation unit 5053, configured to multiply a text similarity between the search statement and each piece of text data in the text data set by a semantic role similarity when the number of the search keyword vectors is within a third preset range, so as to obtain a global similarity between the search statement and each piece of text data in the text data set;

a fourth calculating unit 5054, configured to, when the number of the search keyword vectors is within a fourth preset range, multiply the text similarity between the search statement and each piece of text data in the text data set by the semantic role similarity and the entity similarity in sequence to obtain a global similarity between the search statement and each piece of text data in the text data set.

In the embodiment of the invention, the modularized design ensures that hardware of each part of the searching device based on the search statement is concentrated on realizing a certain function, the performance of the hardware is realized to the maximum extent, and meanwhile, the modularized design also reduces the coupling between the modules of the device, thereby being more convenient to maintain.

Fig. 5 and 6 describe the search apparatus based on the search statement in the embodiment of the present invention in detail from the perspective of the modular functional entity, and the search device based on the search statement in the embodiment of the present invention is described in detail from the perspective of the hardware processing.

Fig. 7 is a schematic structural diagram of a search apparatus based on a search statement according to an embodiment of the present invention, where the search apparatus 700 based on a search statement may have a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 710 (e.g., one or more processors) and a memory 720, one or more storage media 730 (e.g., one or more mass storage devices) for storing applications 733 or data 732. Memory 720 and storage medium 730 may be, among other things, transient storage or persistent storage. The program stored in the storage medium 730 may include one or more modules (not shown), each of which may include a series of instruction operations for the search apparatus 700 based on the search sentence. Still further, the processor 710 may be configured to communicate with the storage medium 730 to execute a series of instruction operations in the storage medium 730 on the search apparatus 700 based on the search statement.

The search apparatus 700 based on search statements may also include one or more power supplies 740, one or more wired or wireless network interfaces 750, one or more input-output interfaces 760, and/or one or more operating systems 731, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, and so forth. It will be understood by those skilled in the art that the search sentence-based search apparatus configuration shown in fig. 7 does not constitute a limitation of the search sentence-based search apparatus, and may include more or less components than those shown, or combine some components, or a different arrangement of components.

The invention also provides a search apparatus based on a search statement, which comprises a memory and a processor, wherein the memory stores computer readable instructions, and the computer readable instructions, when executed by the processor, cause the processor to execute the steps of the search method based on the search statement in the above embodiments.

The present invention also provides a computer-readable storage medium, which may be a non-volatile computer-readable storage medium, and which may also be a volatile computer-readable storage medium, having stored therein instructions, which, when run on a computer, cause the computer to perform the steps of the search statement-based search method.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A search method based on a search statement is characterized in that the search method based on the search statement comprises the following steps:

2. The search sentence-based search method of claim 1, wherein the named entity identifying the at least one search keyword vector and the plurality of text participle vectors comprises:

3. The search sentence-based search method of claim 2, wherein the performing a first round of supervised training on a preset named entity recognition model with the initial training data set as a first round of training data set comprises:

4. The search sentence-based search method of claim 1, wherein the semantic role prediction of the at least one search keyword vector and the plurality of text participle vectors comprises:

5. The search method based on the search statement according to claim 4, wherein the sequentially performing forward semantic role analysis and reverse semantic role analysis on the part-of-speech vectors of the participles corresponding to the target vector based on the preset role analysis model, and determining the semantic role type of the search keyword vector and the semantic role type of each text participle vector according to the analysis result comprises:

6. The search method based on the search sentence according to any one of claims 1 to 5, wherein the calculating at least one of a text similarity, a semantic role similarity, and an entity similarity between the search sentence and each piece of text data in the text data set based on a preset calculation rule to obtain the global similarity between the search sentence and each piece of text data in the text data set comprises:

7. The search sentence-based search method according to claim 6, further comprising, after the multiplying the entity similarity and the semantic role similarity between the search sentence and each text data in the text data set when the number of the search keyword vectors is within a second preset range to obtain a global similarity between the search sentence and each text data in the text data set, the method further comprising:

8. A search sentence-based search apparatus, comprising:

9. A search sentence-based search apparatus characterized by comprising: a memory and at least one processor, the memory having instructions stored therein;

the at least one processor invokes the instructions in the memory to cause the search statement based search apparatus to perform the search statement based search method of any one of claims 1-7.

10. A computer-readable storage medium having instructions stored thereon, wherein the instructions, when executed by a processor, implement a search statement-based search method according to any one of claims 1 to 7.