CN111737560A - Content search method, field prediction model training method, device and storage medium - Google Patents

Content search method, field prediction model training method, device and storage medium Download PDF

Info

Publication number
CN111737560A
CN111737560A CN202010700846.1A CN202010700846A CN111737560A CN 111737560 A CN111737560 A CN 111737560A CN 202010700846 A CN202010700846 A CN 202010700846A CN 111737560 A CN111737560 A CN 111737560A
Authority
CN
China
Prior art keywords
keyword
domain
vector
word
field
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010700846.1A
Other languages
Chinese (zh)
Other versions
CN111737560B (en
Inventor
邹若奇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An International Smart City Technology Co Ltd
Original Assignee
Ping An International Smart City Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An International Smart City Technology Co Ltd filed Critical Ping An International Smart City Technology Co Ltd
Priority to CN202010700846.1A priority Critical patent/CN111737560B/en
Publication of CN111737560A publication Critical patent/CN111737560A/en
Application granted granted Critical
Publication of CN111737560B publication Critical patent/CN111737560B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9532Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to artificial intelligence, which is applied to smart cities and improves search precision by using natural language processing. Specifically disclosed are a content search method, a field prediction model training device and a storage medium, wherein the search method comprises the following steps: acquiring a search text, and segmenting the search text to extract a plurality of keywords; embedding the keywords to obtain keyword vectors of the keywords; determining the part of speech of each keyword according to the keyword vector based on the trained part of speech tagging model; determining whether the keywords are domain words or not according to the keyword vectors based on the trained domain prediction model; determining the weight value of each keyword according to the part of speech of the keyword and whether the keyword is a field word, wherein the weight value of the keyword which is the field word is larger than the weight value of the keyword which is not the field word; and outputting a search result according to the keywords and the weight values thereof based on the search engine. The application also relates to the field of blockchains, and the trained domain prediction model can be stored in a blockchain node.

Description

Content search method, field prediction model training method, device and storage medium
Technical Field
The present application relates to the field of natural language processing technologies, and in particular, to a content search method, a field prediction model training device, and a storage medium.
Background
In the current information age, users can not acquire information such as courses, knowledge, news, hot spots and the like without searching tools, but in some search engines, search results returned by the search engines to the users are greatly different from search results expected by the users.
Natural Language Processing is an important field in the field of artificial intelligence, and for human beings, basic semantic understanding is a daily necessary Language ability, while for artificial intelligence, the basic semantic understanding represents the highest level of progress, and many research institutes take Natural Language Processing (NLP) as a technical focus. Research on the improvement of search efficiency by using natural language processing technology in the field of search engines is continuously being conducted. However, at present, most search results returned by search technologies cannot be biased to the field which is more important to users, the search accuracy is low, and users are required to search for desired search results from a large number of search results.
Disclosure of Invention
The application provides a content search method, a field prediction model training device and a storage medium, which can better bias search results returned by a search engine to a field which is more emphasized by a user so as to improve the accuracy of search.
In a first aspect, the present application provides a content search method, including:
acquiring a search text, and segmenting the search text to extract a plurality of keywords;
embedding the keywords to obtain a keyword vector of each keyword;
determining the part of speech of each keyword according to the keyword vector based on the trained part of speech tagging model;
determining whether each keyword is a domain word or not according to the keyword vector based on a trained domain prediction model;
determining the weight value of each keyword according to the part of speech of each keyword and whether each keyword is a field word, wherein the weight value of the keyword which is the field word is larger than the weight value of the keyword which is not the field word;
and outputting a search result according to the keywords and the weight values of the keywords based on a search engine.
In a second aspect, the present application provides a method for training a domain prediction model, the method comprising:
acquiring training data, wherein the training data comprises content texts in hot fields and content texts in remote fields;
on the basis of a first domain prediction model, extracting a hidden state sequence from a content text of the hot domain, and reducing the dimension of the hidden state sequence to obtain a first scoring vector of each keyword in the content text, which is regarded as a domain word;
performing full connection processing on the first component of each keyword in the content text, which is regarded as a domain word, based on a first full connection layer to obtain a first component;
extracting a hidden state sequence from the content text of the uncommon field based on a second field prediction model, and reducing the dimensions of the hidden state sequence to obtain a second scoring vector of each keyword in the content text regarded as a field word;
based on a second full-connection layer, performing full-connection processing on a second scoring vector of each keyword in the content text, which is regarded as a field word, to obtain a second scoring value;
calculating a scoring loss value according to the first scoring value and the second scoring value, calculating a maximum mean difference loss according to the first scoring vector and the second scoring vector, and calculating a joint loss value according to the scoring loss value and the maximum mean difference loss;
and adjusting parameters of the second domain prediction model according to the joint loss value, and determining the adjusted second domain prediction model as the domain prediction model.
In a third aspect, the present application provides a content search apparatus, the apparatus comprising:
the word segmentation module is used for acquiring a search text and segmenting the search text to extract a plurality of keywords;
the embedding module is used for embedding the plurality of keywords to obtain a keyword vector of each keyword;
the part-of-speech determining module is used for determining the part-of-speech of each keyword according to the keyword vector based on the trained part-of-speech tagging model;
the domain determining module is used for determining whether each keyword is a domain word or not according to the keyword vector based on a trained domain prediction model;
the weight determining module is used for determining the weight value of each keyword according to the part of speech of each keyword and whether each keyword is a field word, wherein the weight value of the keyword which is the field word is larger than that of the keyword which is not the field word;
and the searching module is used for outputting a searching result according to the keyword and the weight value of the keyword based on a searching engine.
In a fourth aspect, the present application provides an apparatus for training a domain prediction model, the apparatus comprising:
the data acquisition module is used for acquiring training data, wherein the training data comprises content texts in hot fields and content texts in remote fields;
the first vector processing module is used for extracting a hidden state sequence from the content text of the popular domain and reducing the dimension of the hidden state sequence to obtain a first scoring vector of each keyword in the content text regarded as a domain word based on a first domain prediction model;
the first connection module is used for carrying out full connection processing on the first component of each keyword in the content text regarded as the domain word based on a first full connection layer to obtain a first component;
the second vector processing module is used for extracting a hidden state sequence from the content text of the remote field and reducing the dimension of the hidden state sequence to obtain a second scoring vector of each keyword in the content text regarded as a field word based on a second field prediction model;
the second connection module is used for carrying out full connection processing on a second scoring vector of each keyword in the content text regarded as a field word based on a second full connection layer to obtain a second scoring value;
a loss determination module, configured to calculate a scoring loss value according to the first scoring value and the second scoring value, calculate a maximum mean difference loss according to the first scoring vector and the second scoring vector, and calculate a joint loss value according to the scoring loss and the maximum mean difference loss;
and the parameter adjusting module is used for adjusting the parameters of the second domain prediction model according to the joint loss value and determining the adjusted second domain prediction model as the domain prediction model.
In a fifth aspect, the present application provides a computer device comprising a memory and a processor; the memory is used for storing a computer program; the processor is configured to execute the computer program and implement the content search method and/or the domain prediction model training method when executing the computer program.
In a sixth aspect, the present application provides a computer-readable storage medium storing a computer program, which, if executed by a processor, implements the content search method and/or the domain prediction model training method described above.
The application discloses a content search method, a field prediction model training method, a device and a storage medium, wherein the method comprises the steps of segmenting an obtained search text to extract keywords, embedding the keywords, performing part-of-speech tagging on the keywords through a part-of-speech tagging model, judging whether the keywords are field words or not through the field prediction model, and determining the weight of the keywords in the search text according to the part-of-speech tagging of the keywords and the fact whether the keywords are the field words or not, so that a search result returned by a search engine is closer to the requirement of a user, and the search accuracy of the search text is improved based on artificial intelligence, especially natural language processing.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic flow chart of a content search method according to an embodiment of the present application;
FIG. 2 is a schematic flow chart illustrating a domain prediction model training method according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a training domain prediction model according to an embodiment of the present application;
FIG. 4 is a schematic flow chart diagram illustrating a domain prediction model training method according to another embodiment of the present application;
FIG. 5 is a flowchart illustrating a part-of-speech prediction model training method according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of a content search apparatus according to an embodiment of the present application;
FIG. 7 is a schematic structural diagram of a domain predictive model training apparatus according to an embodiment of the present application;
fig. 8 is a schematic diagram of a computer device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The flow diagrams depicted in the figures are merely illustrative and do not necessarily include all of the elements and operations/steps, nor do they necessarily have to be performed in the order depicted. For example, some operations/steps may be decomposed, combined or partially combined, so that the actual execution sequence may be changed according to the actual situation. In addition, although the division of the functional blocks is made in the device diagram, in some cases, it may be divided in blocks different from those in the device diagram.
The embodiment of the application provides a content search method, a model training method, a device, computer equipment and a computer readable storage medium. The method is used for improving the searching accuracy based on artificial intelligence. For example, in a smart city, it is often necessary to search required data from a large amount of data according to a search text, for example, as the smart city develops, a large number of resources such as curriculum courses and live broadcasts are accumulated, and the required resources can be obtained from these resources according to the content search method of the embodiment of the present application.
The content search method and the model training method can be used for a server, and can also be used for a terminal, wherein the terminal can be an electronic device such as a mobile phone, a tablet computer, a notebook computer and a desktop computer; the servers may be, for example, individual servers or clusters of servers. However, for the sake of understanding, the following embodiments will be described in detail with reference to a content search method and a model training method applied to a server.
Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.
Referring to fig. 1, fig. 1 is a schematic flow chart of a content search method according to an embodiment of the present application.
As shown in fig. 1, the content search method may include the following steps S110 to S160.
Step S110, a search text is obtained, and word segmentation is carried out on the search text to extract a plurality of keywords.
In some embodiments, the search text may be obtained from a terminal of the user.
Illustratively, a user enters search text in a search box displayed by a user terminal; the server operating the content search method acquires a search text transmitted from the user terminal, and the acquired search text includes, for example, "the principles of pension guarantee gap calculation for premium sales".
Illustratively, the search text may be participled based on a participle model. The segmentation model can be obtained by training the neural network model according to the labeled segmentation data, and the parameters of the neural network model can be obtained by learning and adjusting from the labeled segmentation data based on an algorithm framework of online machine learning.
For example, the tagged participle data may include participle data of a common corpus, such as open-source corpus participle data, and/or a business corpus, such as business corpus participle data stored on a server running the content search method.
Illustratively, the search text may be participled based on a participle model and sequence tagging of the words. For the word sequence of the input search text, the word segmentation model can label each word in the search text with a mark for identifying a word boundary, and a plurality of keywords in the search text can be determined according to the mark for identifying the word boundary.
The search text may also be tokenized, illustratively, based on the tokenization model and the labeled tokenization data. For the input search text, the word segmentation model can compare the search text with the labeled word segmentation data, and according to the comparison result, the same or similar word groups are determined as a plurality of keywords in the search text.
And step S120, embedding the plurality of keywords to obtain a keyword vector of each keyword.
Since the deep learning model receives an input of numbers, not character strings, after the search text is acquired and several keywords are extracted among the search text, the keywords need to be converted into a form of word vectors. Common word vector training representation methods are word2vec and glove.
In some embodiments, a Word2Vec model can be used to embed a plurality of keywords obtained by Word segmentation to obtain a keyword vector, so as to implement vectorization coding of a search text, and the coding is used as input data of a subsequent model. Word2Vec is a group of common relevant models for generating Word vectors, and semantic information of words is represented in a vector form by learning Word texts; the model visualizes the relationship between words, and similar/similar phrases are very close to each other in the visualized pattern of Word2 Vec.
Illustratively, a Word2Vec model (e.g., CBOW model) may be trained using a genim library of functions, and optionally, professional domain corpora of various industries may be used as a corpus for training the Word2Vec model.
In other embodiments, the search text is divided in sentence units, and the keywords extracted from the sentences are embedded to obtain a keyword vector of each keyword.
Illustratively, a sentence (sequence of words) containing n keywords is written as: x = (x 1, x 2.,. multidrug, xi.,. xn) where xi represents the id of the ith keyword in the sentence in the dictionary, and thus one-hot vector corresponding to each keyword in the search text can be obtained, and the dimension is the size of the dictionary.
Illustratively, the one-hot vector corresponding to each keyword may be processed by a dimension reduction layer, such as a look-up layer, to obtain a keyword vector. For example, a word embedding (embedding) matrix which is pre-trained or randomly initialized can be used for mapping each keyword xi in a sentence to a low-dimensional dense keyword vector (characterembedding) from one-hot vector, xi belongs to Rd, d is the dimension of the word embedding matrix, the data volume is reduced through dimension reduction processing, each keyword is roughly classified, and therefore the result of part-of-speech classification is more accurate after the keyword is subjected to subsequent part-of-speech classification steps.
And S130, determining the part of speech of each keyword according to the keyword vector based on the trained part of speech tagging model.
For the obtained keyword vector of each keyword, a trained part-of-speech tagging model, for example, a trained network model of Bilstm + crf, may be used to determine the part-of-speech of each keyword.
Illustratively, the part-of-speech tagging model includes a Bilstm layer, a linear layer, and a CRF layer. The Bilstm layer is a bidirectional long and short memory network and comprises a forward (forward) LSTM layer and a backward (backward) LSTM layer, and the CRF is a conditional random field network.
In some embodiments, determining the part-of-speech of each of the keywords according to the keyword vector based on the trained part-of-speech tagging model includes steps S131 to S134.
Step S131, based on the Bilstm layer, determining a forward hidden state vector from the forward direction of the plurality of keyword vectors, and determining a backward hidden state vector from the backward direction of the plurality of keyword vectors.
For example, since the keywords in the search text often have a front-back association relationship, and the keywords arranged at the back may affect the meaning expression of the preceding keywords, for better processing the keyword vector, the blstm layer processes the words from two different directions, respectively, in front of and behind the words.
For example, a sequence x = (x 1, x 2., xn) composed of keyword vectors is used as input of each time step of a forward LSTM layer and a backward LSTM layer, the keyword vectors are processed from the forward direction through the forward LSTM layer, and forward hidden state vectors are output
Figure 159738DEST_PATH_IMAGE001
And processing the keyword vector from a backward direction through a backward LSTM layer and outputting a backward hidden state vector
Figure 128831DEST_PATH_IMAGE002
. The processing from the front direction and the back direction avoids the influence of the previous vector when processing the keyword vector sequence, and improves the accuracy of the part-of-speech recognition of the keyword.
And S132, fusing and dimensionality reduction are carried out on the forward hidden state vector and the backward hidden state vector based on a linear layer, so that classification scores of a plurality of classification labels corresponding to the keywords are obtained.
Illustratively, the forward direction is hidden by the linear layer LSTM's outputState vector
Figure 815027DEST_PATH_IMAGE001
And backward implicit state vector
Figure 959701DEST_PATH_IMAGE002
and (2) splicing or adding to obtain a complete hidden state sequence, wherein ht = (h 1, h 2.,. hn) belongs to Rn × m, then performing dimension reduction processing on the complete hidden state sequence by mapping m dimensions to k dimensions, wherein k can be the number of part-of-speech categories of the part-of-speech category label set, recording the sequence after dimension reduction as a matrix p = (p 1, p 2.,. pn) belongs to Rn × k, and regarding each dimension pij of pi ∈ Rk as a classification score for classifying the keyword xi into the jth classification label.
In this case, since the classification mapping (softmax) process is performed on the matrix p obtained in the previous step, which is equivalent to the independent classification of k for each keyword, and the influence of the sentence on the part of speech is not considered, in order to take the influence of the sentence on the part of speech into consideration, a CRF layer is accessed to perform sentence-level sequence tagging.
Step S133, based on the CRF layer, determining transfer scores of a plurality of classification labels corresponding to each keyword according to a transfer matrix between different classification labels and classification scores of the plurality of classification labels corresponding to each keyword.
Exemplarily, the matrix p and the transfer matrix a in the CRF layer are processed and calculated by the CRF layer, and the calculation result Aij represents a transfer score for transferring from the ith classification label to the jth classification label, so as to obtain a transfer score value of the keyword vector corresponding to the classification labels.
Illustratively, the transition matrix a is a matrix of (k + 2) × (k + 2), and k +2 is because a start state is added at the head of the sentence and an end state is added at the tail of the sentence, and the transition matrix a in the CRF layer can be used to take the mutual influence between the parts of speech of the keywords in the sentence into consideration, so as to obtain a more accurate part of speech classification result.
Step S134, determining the part of speech of each keyword according to the classification score value and the transfer score value of each keyword corresponding to a plurality of classification labels.
Illustratively, pi output by the Bilstm layer and Ai output by the CRF layer are added, softmax calculation is utilized to obtain the probability of each label corresponding to each keyword, and the part of speech of each keyword can be obtained by taking the label corresponding to the maximum probability.
For example, noting that a tag sequence length is y = (y 1, y 2., yn), the trained part of speech tagging model scores all tags classified into y for keyword x:
Figure 338729DEST_PATH_IMAGE003
the normalized frequency using softmax was:
Figure 670485DEST_PATH_IMAGE004
and obtaining the probability of all the labels in y corresponding to each keyword x after normalization, and determining the part of speech of the keyword x by taking the maximum probability.
And step S140, determining whether each keyword is a domain word or not according to the keyword vector based on the trained domain prediction model.
The search text usually contains information about the domain to which the data the user needs relates, for example, the search text "the principle of calculation of the gap between the late bonds for premium sales" indicates that the user prefers the search engine to return the contents of the "premium sales" domain. By determining the field words obtained from the keywords of the search text, the matching degree of the search content and the search requirement can be improved, and the search precision is improved.
In some embodiments, the trained domain prediction model comprises: a Bilstm layer and a linear layer; and the step of determining whether each keyword is a domain word or not according to the keyword vector based on the trained domain prediction model comprises the steps of S141-S143.
Step S141, based on the blstm layer, determining a first hidden state vector from a forward direction of the plurality of keyword vectors, and determining a second hidden state vector from a backward direction of the plurality of keyword vectors.
Illustratively, a sequence (x 1, x 2.., xn) composed of keyword vectors is used as input of each time step of a forward LSTM layer and a backward LSTM layer, the keyword vectors are processed from the forward direction through the forward LSTM layer, and a first hidden state vector is output
Figure 160372DEST_PATH_IMAGE001
And processing the keyword vector from a backward direction through a backward LSTM layer and outputting a second hidden state vector
Figure 628393DEST_PATH_IMAGE002
And S142, fusing and dimensionality reduction are carried out on the first hidden state vector and the second hidden state vector based on the linear layer, and a domain scoring value of each keyword regarded as a domain word is determined.
Illustratively, by the linear layer LSTM's output, the first hidden state vector will be represented
Figure 178323DEST_PATH_IMAGE001
And a second hidden state vector
Figure 122009DEST_PATH_IMAGE002
and (2) splicing or adding to obtain a first complete hidden state sequence, wherein ht = (h 1, h2,. hn) belongs to Rn × m, then performing dimension reduction processing on the first complete hidden state sequence by mapping m dimensions to k dimensions, wherein k can be the number of different domain labels, recording the sequence after dimension reduction as a matrix p = (p 1, p2,. pn) belongs to Rn × k, and regarding each dimension pij of pi ∈ Rk as a domain scoring value corresponding to the keyword xi to the jth domain label to obtain the domain scoring value of each keyword regarded as the domain word.
Step S143, determining whether each keyword is a domain word according to the domain score value of each keyword regarded as the domain word.
For example, several keywords with the highest scoring values may be determined as domain words, or several keywords with scoring values not less than the scoring threshold may be determined as domain words.
Step S150, determining the weight value of each keyword according to the part of speech of each keyword and whether each keyword is a field word, wherein the weight value of the keyword which is the field word is larger than that of the keyword which is not the field word.
By giving higher weight to the field words in the search text, the matching degree of the search content and the search requirement can be improved, and the search precision is improved.
In some embodiments, the weight values of keywords may be processed according to the following rules:
a. the weight of the domain word is higher than that of the common key word, namely the weight value of the key word of the domain word is higher than that of the common key word.
b. Each word in the domain words is subjected to weight sequencing according to the corresponding part of speech; for example, all are domain words, and the nouns in the domain words have higher weight than the verbs in the domain words.
c. Carrying out weight sequencing on each word in the common keywords according to the corresponding part of speech; for example, the common keywords that are nouns are weighted higher than the parts of speech of the common keywords that are verbs.
d. And combining and normalizing the weights of the domain words and the keywords with different parts of speech.
For example, the weight value of the keyword may be calculated according to the following equation:
if the ith keyword is not a domain word, the weight value weight _ i of the ith keyword is as follows:
Figure 789488DEST_PATH_IMAGE005
if the ith keyword is a domain word, the weighted value is as follows:
Figure 502229DEST_PATH_IMAGE006
Figure 957481DEST_PATH_IMAGE007
wherein n represents the number of keywords and is equal to the sum of the number of the field words and the number of the non-field words; the POS represents the weight corresponding to the part of speech, and can obtain a preset POS according to the preset part of speech weight sorting or a default part of speech weight sorting table; m represents the number of domain words.
For example, for a certain search text, "the principle of calculation of pension guarantee gap for premium sales" the keywords determined after word segmentation and the weight values of the keywords:
common keywords and weight values corresponding to the common keywords:
the amount of the premium 0.0652 is,
at the point of sale of 0.05435, the product is sold,
at the point of 0.01087, the first and second,
the old-age nursing gold 0.0652 is prepared by the following steps,
the guarantee 0.05435 is that the safety device is in a safe state,
the notches 0.0652 are formed in the outer surface of the body,
the calculation 0.05435 is carried out in such a way that,
the principle 0.0652 is that the first and second elements,
the field words and the corresponding weighted values of the field words are as follows:
premium sales 0.63048.
It can be seen that the weighted value of the field word 'premium sales' is the largest, the search result returned by the search engine is more biased to the 'premium sales', the search requirement of the user is better met, and the search result precision is improved.
And step S160, outputting a search result according to the keywords and the weight values of the keywords based on a search engine.
In some embodiments, the keywords and the weight values of the keywords are returned to the solr search engine or other search engines, which can configure different queries for each keyword, and determine the query depth according to the weight values, so that the search results can be output according to the keywords and the weight values of the keywords, and the output search results can be better biased to the field of the search requirements of the user.
For example, the search engine may also assign different search weights to different index types, such as a title having a weight greater than a course profile.
Illustratively, the search engine may also perform a keyword and or not (and, or, not) combination for business needs. And finally, the aim of highlighting a plurality of keywords concerned by the user to obtain higher weight and simultaneously covering the aim that all intentions of the user do not lose the recall result is fulfilled.
For example, the search engine may output search results based on the user's use of multiple keywords associated with or in combination, such as user input ("premium sales" and "pension" or "premium" or "sales"), keywords associated with a combination may get higher weights the same, or keywords associated with a combination may get lower weights, ensuring that the returned search results are biased toward the user's needs while not losing results participating in the ranking.
Illustratively, the search engine returns a desired search result according to its own text similarity algorithm, and the server may transmit the search result to the user terminal.
In some embodiments, the method further comprises: determining whether each keyword is a key named entity.
Exemplarily, after the step S140, the method further includes: determining whether each keyword is a key named entity.
Specifically, the step S150 determines the weight value of each keyword according to the part of speech of each keyword and whether each keyword is a domain word, and includes:
determining the weight value of each keyword according to the part of speech of each keyword, whether each keyword is a field word and whether each keyword is a key named entity; wherein the weight value of the keyword which is the key named entity is larger than the weight value of the keyword which is not the key named entity and is not the domain word.
The named entities are, for example, area names, person names, organization names, and all other entities identified by names. The broader entities also include numbers, dates, currencies, addresses, and the like. Also included are business nouns, product nouns, and daily nouns.
Illustratively, a service specific naming entity table is stored in advance, and key naming entities can be screened from keywords according to the service specific naming entity table.
For example, after a search text is acquired and word segmentation is performed to extract keywords, the part of speech of each keyword is determined, whether each keyword is a domain word or not is judged, each keyword is compared with a service specific naming entity table, and the same word is a key naming entity.
By determining the weight value of the keyword according to whether the keyword is the key named entity, the weight value of the keyword can be more accurate, noise in a search result is reduced, particularly the search result can be more biased to the key named entity in a search text, and the search accuracy is higher.
Referring to fig. 2, fig. 2 is a schematic flowchart of a training method of a domain prediction model according to an embodiment of the present application. The training method of the domain prediction model can be used for training the domain prediction model. The trained domain prediction model can be deployed at a terminal or a server, so that whether the keywords are domain words or not can be determined according to the keyword vectors when the content search method is executed, and the accuracy of content search is improved.
In some embodiments, the trained domain prediction model may be stored in a blockchain node. The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism and an encryption algorithm. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
Referring to fig. 2, the method for training the domain prediction model specifically includes steps S210 to S270.
Step S210, training data are obtained, wherein the training data comprise content texts in popular fields and content texts in uncommon fields.
The training data comprises content texts which may be unevenly distributed in different fields, for example, near million courses provide training data with a large amount of texts, but in some of the remote fields, insufficient data amount still occurs, or data amount which is not included in the field, especially some newly opened classes. Therefore, the problem of uneven field distribution of the training data content text is solved by acquiring the content text of the hot field and mining and acquiring the content text of the cold field, and the prediction effect of the model on the field data with insufficient data volume is improved by learning the data of the related field.
In some embodiments, the method further comprises: the method comprises the following steps of carrying out word segmentation and embedding processing on a content text in the popular field to obtain a plurality of keyword vectors of the content text in the popular field, and carrying out word segmentation and embedding processing on the content text in the rarely-used field to obtain a plurality of keyword vectors of the content text in the rarely-used field.
Illustratively, the content text in the popular field and the content text in the rare field can be segmented respectively based on the segmentation model to extract keywords, then, a plurality of keywords obtained by segmentation can be embedded based on the Word2Vec model to obtain a keyword vector, so that vectorization coding of the search text is realized, and the vector is used as input data of a subsequent model.
Referring to fig. 3, fig. 3 is a schematic diagram of training a domain prediction model according to the content text of the popular domain and the content text of the uncommon domain.
Step S220, based on a first domain prediction model, extracting a hidden state sequence from the content text of the hot domain, and reducing the dimension of the hidden state sequence to obtain a first scoring vector of each keyword in the content text regarded as a domain word.
Illustratively, based on a first domain prediction model, the keyword features in the sentence are directly extracted from the content text of the hot domain by taking the sentence as a unit, the features are combined to generate a hidden state sequence corresponding to the hot domain, and a first scoring vector is calculated by reducing the dimension of the obtained hidden state sequence.
For example, based on the first domain prediction model, a hidden state sequence corresponding to the hot domain may be generated from the obtained several keyword vectors of the content text of the hot domain, and the obtained hidden state sequence may be subjected to dimensionality reduction to calculate a first scoring vector.
In some embodiments, the first domain model comprises: a Bilstm layer and a linear layer (LSTM's output layer). Step S220 includes steps S221 to S222.
Step S221, determining a first hidden state vector from a forward direction of the content text of the trending domain and determining a second hidden state vector from a backward direction of the content text of the trending domain based on the blstm layer of the first domain prediction model.
In some embodiments, the keyword features in the sentences of the content text of the trending domain are extracted starting from the forward direction and all the extracted features are combined to generate a first hidden state vector, while the keyword features in the sentences of the content text of the trending domain are extracted starting from the backward direction and all the extracted features are combined to generate a second hidden state vector.
In other embodiments, the keyword vector is extracted from the forward direction and all vectors are combined to generate a first hidden state vector, while the keyword vector is extracted from the backward direction and all vectors are combined to generate a second hidden state vector.
Step S222, based on the linear layer of the first domain prediction model, fusing and dimensionality reduction are performed on the first hidden state vector and the second hidden state vector, so as to obtain a first scoring vector in which each keyword in the content text is regarded as a domain word.
Illustratively, a first complete hidden state sequence is obtained by splicing or adding a first hidden state vector and a second hidden state vector, then dimension reduction processing is performed on the first complete hidden state sequence from m dimension to k dimension, k may be the number of different domain labels, the sequence after dimension reduction is recorded as a matrix p = (p 1, p 2.. pn.) e Rn × k, and each dimension pij of pi ∈ Rk may be regarded as a domain scoring vector corresponding the keyword xi to the jth domain label.
Step S230, performing full join processing on the first component of each keyword in the content text regarded as a domain word based on a first full join layer (dense layer), so as to obtain a first component.
Exemplarily, the obtained domain scoring vectors of which the keywords are regarded as domain words enter a first full-connection layer to perform matrix vector product calculation processing, so as to obtain a first scoring value. The first score value represents a score corresponding to a keyword regarded as a domain word in the content text of the popular domain.
Step S240, based on a second domain prediction model, extracting a hidden state sequence from the content text of the uncommon domain, and reducing the dimensions of the hidden state sequence to obtain a second scoring vector of each keyword in the content text regarded as a domain word.
Illustratively, the second domain prediction model also includes: a Bilstm layer and a linear layer. The extracting a hidden state sequence from the content text of the rarely-used field and reducing the dimension of the hidden state sequence to obtain a second scoring vector of each keyword in the content text regarded as a field word based on a second field prediction model comprises the following steps: determining a third hidden state vector from the forward direction of the content text of the uncommon field and a fourth hidden state vector from the backward direction of the content text of the uncommon field based on a BILSTM layer of the second field prediction model; and based on a linear layer of the second domain prediction model, fusing and dimensionality reduction are carried out on the third hidden state vector and the fourth hidden state vector to obtain a second scoring vector, wherein each keyword in the content text is regarded as a domain word.
Referring to step S220, a second scoring vector, which is regarded as a domain word, is obtained for each keyword in the content text in the uncommon domain. For example, based on a second domain prediction model, the keyword features in the sentences are directly extracted from the content text of the uncommon domain by taking the sentences as units, the features are combined to generate a hidden state sequence corresponding to the uncommon domain, and the second scoring vector is calculated by reducing the dimension of the hidden state sequence.
And S250, carrying out full connection processing on second scoring vectors of the keywords regarded as the field words in the content text based on a second full connection layer to obtain second scoring values.
Referring to step S230, a second score value is obtained after performing full join processing on the second score vector and the second full join layer. And the second score value represents the score corresponding to the key words regarded as the domain words in the content text of the uncommon domain.
Step S260, calculating a maximum mean difference loss (mmd) according to the first scoring vector and the second scoring vector, calculating a scoring loss value according to the first scoring value and the second scoring value, and calculating a joint loss value according to the scoring loss value and the maximum mean difference loss.
In some embodiments, a first scoring value output by a first full-link layer and a second scoring value output by a second full-link layer are calculated to obtain a scoring loss value, and a first scoring vector in the first full-link layer and a second scoring vector in the second full-link layer are vector-calculated to obtain a maximum mean difference loss. And the combined loss value is calculated by the scoring loss value and the maximum mean difference loss, for example, by averaging or performing a weighted summation.
And step S270, adjusting parameters of the second domain prediction model according to the joint loss value, and determining the adjusted second domain prediction model as the domain prediction model.
Exemplarily, the joint loss value can better reflect the deviation of the second field prediction model to the field word prediction in the content text in the rare field, and the second field prediction model can more accurately identify the field words in the content text in the rare field by adjusting the parameters of the second field prediction model according to the joint loss value, so that the prediction accuracy of the field prediction model in predicting the field words in the search text is improved.
In other embodiments, referring to fig. 4, the method for training the domain prediction model specifically includes steps S310 to S340.
Step S310, training data are obtained, and the training data comprise content texts and field word labels corresponding to the content texts.
The field word tags can be determined manually in the content titles corresponding to the content texts, for example, keywords are extracted from the content titles, and then field words are obtained by manual screening from the extracted keywords and labeled with tags.
Illustratively, the keywords extracted from the content titles may be updated periodically, and the update threshold may be a time limit, a data limit, a predicted performance index, and the like. The updating of the keywords can enable the domain word labels in the training data to be changed according to the search text, such as the actual course update.
Step S320, determining a domain scoring value regarded as a domain word in the content text in the training data based on a domain prediction model.
Referring to steps S141 to S142, a domain score value is obtained by using the domain prediction model, where each word in the content text in the training data is regarded as a domain word.
And S330, calculating a loss value according to the field scoring value and the field word label corresponding to the content text.
Illustratively, a plurality of domain words in the content text can be determined according to the domain scoring values, and then the domain words are compared with the domain word labels corresponding to the corresponding content text to obtain the loss value.
And step S340, adjusting parameters of the field prediction model according to the loss value.
Illustratively, the parameters of the domain prediction model are adjusted according to the obtained loss value, so that the domain prediction model can be more biased to the domain corresponding to the expected search result of the user when performing domain prediction, and the search precision is improved.
In other embodiments, referring to fig. 5, fig. 5 is a flowchart illustrating a method for training a part-of-speech tagging model, where the method for training the part-of-speech tagging model specifically includes steps S410 to S440.
Step S410, training data are obtained, and the training data comprise search logs and resource text data.
The search log can be the text information input by the user in each search; the resource text data, for example, the bird app, may be a title, a content introduction, text content of the course resource, a live room name, an author, a title, a content introduction, and other text content of the live video course.
Illustratively, training data that is supervised learning needs to be labeled manually. For example, part-of-speech tagging is performed on training data, and an RNN (neural network) model of Bilstm + CRF can be trained according to the tagged data, so that the part-of-speech prediction accuracy of the model on search text keywords is improved.
Illustratively, manual annotation may be performed by the BI0 annotation method, using the BIO annotation set employed in the Bakeoff-3 evaluation, i.e., B-PER, I-PER stands for the initials of the human name, B-LOC, I-LOC stands for the initials of the place name, B-ORG, I-ORG stands for the initials of the organization, the initials of the organization name, and O stands for a word that does not belong to a part of the keyword. And training the part-of-speech of the predicted text on an RNN (neural network) model of the Bilstm + CRF by using the training data with the part-of-speech labels.
Step S420, segmenting words from the training data to extract a plurality of keywords, and embedding the keywords to obtain keyword vectors of the keywords.
Referring to step S120, a keyword vector of each keyword is obtained by performing an embedding process on the training data.
And step S430, determining a corresponding part-of-speech score value of each keyword according to the vector of each keyword based on an RNN model of Bilstm + CRF.
Referring to steps S131 to S133, a corresponding part-of-speech score value of each keyword is obtained according to the vector of each keyword.
Step S440, adjusting RNN network model parameters of the Bilstm + CRF according to the corresponding part-of-speech score value of each keyword, and determining the adjusted RNN network model of the Bilstm + CRF as the trained part-of-speech tagging model.
Illustratively, the parameters of the model may be calculated using a log-likelihood function maximization according to the classification score of the corresponding label. And after calculating parameters, adjusting RNN model parameters of the Bilstm + CRF to obtain a trained part-of-speech tagging model.
Referring to fig. 6, fig. 6 is a schematic diagram of a content search apparatus according to an embodiment of the present application, where the content search apparatus may be configured in a server or a terminal for executing the content search method.
As shown in fig. 6, the content search apparatus includes: a word segmentation module 110, an embedding module 120, a part of speech determination module 130, a domain determination module 140, a weight determination module 150, and a search module 160.
The word segmentation module 110 is configured to obtain a search text, and perform word segmentation on the search text to extract a plurality of keywords.
The embedding module 120 is configured to perform embedding processing on the plurality of keywords to obtain a keyword vector of each keyword.
And a part-of-speech determining module 130, configured to determine, based on the trained part-of-speech tagging model, a part-of-speech of each keyword according to the keyword vector.
And the domain determining module 140 is configured to determine whether each keyword is a domain word according to the keyword vector based on the trained domain prediction model.
The weight determining module 150 is configured to determine a weight value of each keyword according to a part of speech of each keyword and whether each keyword is a domain word, where the weight value of a keyword that is a domain word is greater than the weight value of a keyword that is not a domain word.
And the search module 160 is configured to output a search result according to the keyword and the weight value of the keyword based on a search engine.
In some embodiments, the domain predictive model includes a Bilstm layer, a linear layer. The domain determining module 140 includes:
a bidirectional long and short memory network (blstm layer) sub-module 141, configured to determine a first hidden state vector from a forward direction of a number of the keyword vectors and a second hidden state vector from a backward direction of a number of the keyword vectors based on the blstm layer.
A linear layer (LSTM's output layer) sub-module 142, configured to perform fusion and dimensionality reduction on the first hidden state vector and the second hidden state vector based on the linear layer, and determine a domain score value of each keyword regarded as a domain word;
the domain determining sub-module 143 is configured to determine whether each of the keywords is a domain word according to a domain score value of each of the keywords regarded as a domain word.
Illustratively, the content search apparatus further includes: and a named entity determining module 170, configured to determine whether each of the keywords is a key named entity.
Exemplarily, the weight determining module 150 is further configured to determine a weight value of each keyword according to a part of speech of each keyword, whether each keyword is a domain word, and whether each keyword is a key named entity; the weight value of the keyword which is the key named entity is larger than the weight value of the keyword which is not the key named entity and is not the domain word.
Referring to fig. 7, fig. 7 is a schematic diagram of a domain prediction model training apparatus according to an embodiment of the present application, where the domain prediction model training apparatus may be configured in a server or a terminal for executing the above-mentioned training method of the domain prediction model.
As shown in fig. 7, the domain prediction model training apparatus includes: the system comprises a data acquisition module 210, a first vector processing module 220, a first connection module 230, a second vector processing module 240, a second connection module 250, a loss determination module 260, and a parameter adjustment module 270.
The data acquiring module 210 is configured to acquire training data, where the training data includes content texts in popular areas and content texts in uncommon areas.
The first vector processing module 220 is configured to extract a hidden state sequence from the content text of the popular domain based on a first domain prediction model, and perform dimension reduction on the hidden state sequence to obtain a first scoring vector, where each keyword in the content text is regarded as a domain word.
The first connection module 230 is configured to perform full connection processing on the first component that each keyword in the content text is regarded as a domain word based on the first full connection layer, so as to obtain a first component.
And the second vector processing module 240 is configured to extract a hidden state sequence from the content text of the uncommon field based on a second field prediction model, and reduce the dimensions of the hidden state sequence to obtain a second scoring vector in which each keyword in the content text is regarded as a field word.
And the second connection module 250 is configured to perform full connection processing on the second scoring vector, in which each keyword in the content text is regarded as a domain word, based on a second full connection layer, so as to obtain a second scoring value.
A loss determining module 260, configured to calculate a scoring loss value according to the first scoring value and the second scoring value, calculate a maximum mean difference loss according to the first scoring vector and the second scoring vector, and calculate a joint loss value according to the scoring loss value and the maximum mean difference loss.
And a parameter adjusting module 270, configured to adjust a parameter of the second domain prediction model according to the joint loss value, and determine that the adjusted second domain prediction model is the domain prediction model.
Illustratively, the data acquisition module 210 includes:
the word segmentation processing sub-module 211 is used for segmenting the content text in the popular field to obtain a plurality of keywords of the content text in the popular field and segmenting the content text in the uncommon field to obtain a plurality of keywords of the content text in the uncommon field.
The embedding processing submodule 212 is used for embedding a plurality of keywords of the popular field content text to obtain a plurality of keyword vectors of the popular field content text and embedding a plurality of keywords of the uncommon field content text to obtain a plurality of keyword vectors of the uncommon field content text.
Illustratively, the first component processing module 220 is specifically configured to extract a hidden state sequence from a number of keyword vectors of the content text of the hot domain based on a first domain prediction model.
Illustratively, the second vector processing module 240 is specifically configured to extract a hidden state sequence from a number of keyword vectors of the content text of the uncommon field based on a second domain prediction model.
Illustratively, the first domain prediction model comprises: a Bilstm layer and a linear layer.
The first vector processing module 220 is specifically configured to determine, based on the BILSTM layer, a first hidden state vector from a forward direction of the content text of the trending domain, and a second hidden state vector from a backward direction of the content text of the trending domain; and based on the linear layer, fusing and dimensionality reduction are carried out on the first hidden state vector and the second hidden state vector to obtain a first scoring vector of each keyword in the content text, which is regarded as a domain word.
It should be noted that, as will be clear to those skilled in the art, for convenience and brevity of description, the specific working processes of the apparatus, the modules and the units described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
The methods, apparatus, and devices of the present application are operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The above-described methods and apparatuses may be implemented, for example, in the form of a computer program that can be run on a computer device as shown in fig. 8.
Referring to fig. 8, fig. 8 is a schematic diagram of a computer device according to an embodiment of the present disclosure. The computer device may be a server or a terminal.
As shown in fig. 8, the computer device includes a processor, a memory, and a network interface connected by a system bus, wherein the memory may include a nonvolatile storage medium and an internal memory.
The non-volatile storage medium may store an operating system and a computer program. The computer program includes program instructions that, when executed, cause a processor to perform any one of the content search methods and/or the aforementioned domain prediction model training methods.
The processor is used for providing calculation and control capability and supporting the operation of the whole computer equipment.
The internal memory provides an environment for running a computer program in the non-volatile storage medium, which, when executed by the processor, causes the processor to perform any one of the content search methods.
The network interface is used for network communication, such as sending assigned tasks and the like. Those skilled in the art will appreciate that the configuration of the computer apparatus is merely a block diagram of a portion of the configuration associated with aspects of the present application and is not intended to limit the computer apparatus to which aspects of the present application may be applied, and that a particular computer apparatus may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
It should be understood that the Processor may be a Central Processing Unit (CPU), and the Processor may be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, etc. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Wherein, in some embodiments, the processor is configured to execute a computer program stored in the memory to implement the steps of: acquiring a search text, and segmenting the search text to extract a plurality of keywords; embedding the keywords to obtain a keyword vector of each keyword; determining the part of speech of each keyword according to the keyword vector based on the trained part of speech tagging model; determining whether each keyword is a domain word or not according to the keyword vector based on a trained domain prediction model; determining the weight value of each keyword according to the part of speech of each keyword and whether each keyword is a field word, wherein the weight value of the keyword which is the field word is larger than the weight value of the keyword which is not the field word; and outputting a search result according to the keywords and the weight values of the keywords based on a search engine.
Illustratively, the processor is configured to implement, based on the trained domain prediction model, when determining whether each of the keywords is a domain word according to the keyword vector, implementing: based on the Bilstm layer, determining a first hidden state vector from a forward direction of a number of the keyword vectors and a second hidden state vector from a backward direction of a number of the keyword vectors; based on the linear layer, fusing and dimensionality reduction are carried out on the first hidden state vector and the second hidden state vector, and a domain score value of each keyword regarded as a domain word is determined; and determining whether each keyword is a field word according to the field score value of each keyword regarded as the field word.
Illustratively, the processor is configured to determine a weighted value of each keyword according to a part of speech of each keyword and whether each keyword is a domain word, where the weighted value of a keyword that is a domain word is greater than the weighted value of a keyword that is not a domain word, and implement: determining whether each keyword is a key named entity; determining the weight value of each keyword according to the part of speech of each keyword, whether each keyword is a field word and whether each keyword is a key named entity; the weight value of the keyword which is the key named entity is larger than the weight value of the keyword which is not the key named entity and is not the domain word.
In other embodiments, the processor is configured to execute a computer program stored in the memory to perform the steps of: acquiring training data, wherein the training data comprises content texts in hot fields and content texts in remote fields; on the basis of a first domain prediction model, extracting a hidden state sequence from a content text of the hot domain, and reducing the dimension of the hidden state sequence to obtain a first scoring vector of each keyword in the content text, which is regarded as a domain word; performing full connection processing on the first component of each keyword in the content text, which is regarded as a domain word, based on a first full connection layer to obtain a first component; extracting a hidden state sequence from the content text of the uncommon field based on a second field prediction model, and reducing the dimensions of the hidden state sequence to obtain a second scoring vector of each keyword in the content text regarded as a field word; based on a second full-connection layer, performing full-connection processing on a second scoring vector of each keyword in the content text, which is regarded as a field word, to obtain a second scoring value; calculating a maximum mean difference loss according to the first scoring vector and the second scoring vector, calculating a scoring loss value according to the first scoring value and the second scoring value, and calculating a joint loss value according to the scoring loss value and the maximum mean difference loss; and adjusting parameters of the second domain prediction model according to the joint loss value, and determining the adjusted second domain prediction model as the domain prediction model.
Illustratively, the processor is configured to, when obtaining training data comprising content text of the popular area and content text of the uncommon area,: performing word segmentation and embedding processing on the content text in the popular field to obtain a plurality of keyword vectors of the content text in the popular field, and performing word segmentation and embedding processing on the content text in the rarely-used field to obtain a plurality of keyword vectors of the content text in the rarely-used field; based on a first domain prediction model, extracting a hidden state sequence from a plurality of keyword vectors of the content text of the hot domain; and extracting a hidden state sequence from a plurality of keyword vectors of the content text in the uncommon field based on a second field prediction model.
Exemplarily, the processor is configured to extract a hidden state sequence from the content text of the trending domain based on the first domain prediction model, and perform dimension reduction on the hidden state sequence to obtain a first scoring vector in which each keyword in the content text is regarded as a domain word, so as to implement: based on the Bilstm layer, determining a first hidden state vector from a forward direction of the content text of the hot domain, and determining a second hidden state vector from a backward direction of the content text of the hot domain; and based on the linear layer, fusing and dimensionality reduction are carried out on the first hidden state vector and the second hidden state vector to obtain a first scoring vector of each keyword in the content text, which is regarded as a domain word.
For example, a training method of the domain prediction model may be used for the trained domain prediction model. The trained domain prediction model is deployed on a terminal or a server, so that whether the keywords are domain words or not can be determined according to the keyword vectors based on the domain prediction model when the content search method is executed, and the accuracy of content search is improved.
From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application, such as:
a computer-readable storage medium, wherein a computer program is stored in the computer-readable storage medium, the computer program includes program instructions, and the processor executes the program instructions to implement any content search method provided in an embodiment of the present application; or
The content search method of any of the above is implemented.
The computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiment, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device.
While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A method for searching for content, the method comprising:
acquiring a search text, and segmenting the search text to extract a plurality of keywords;
embedding the keywords to obtain a keyword vector of each keyword;
determining the part of speech of each keyword according to the keyword vector based on the trained part of speech tagging model;
determining whether each keyword is a domain word or not according to the keyword vector based on a trained domain prediction model;
determining the weight value of each keyword according to the part of speech of each keyword and whether each keyword is a field word, wherein the weight value of the keyword which is the field word is larger than the weight value of the keyword which is not the field word;
and outputting a search result according to the keywords and the weight values of the keywords based on a search engine.
2. The content searching method according to claim 1, wherein the domain prediction model includes a Bilstm layer, a linear layer;
the method for determining whether each keyword is a domain word or not according to the keyword vector based on the trained domain prediction model comprises the following steps:
based on the Bilstm layer, determining a first hidden state vector from a forward direction of a number of the keyword vectors and a second hidden state vector from a backward direction of a number of the keyword vectors;
based on the linear layer, fusing and dimensionality reduction are carried out on the first hidden state vector and the second hidden state vector, and a domain score value of each keyword regarded as a domain word is determined;
and determining whether each keyword is a field word according to the field score value of each keyword regarded as the field word.
3. The content search method according to claim 1 or 2, characterized in that the method further comprises:
determining whether each keyword is a key named entity;
determining the weight value of each keyword according to the part of speech of each keyword and whether each keyword is a field word comprises:
determining the weight value of each keyword according to the part of speech of each keyword, whether each keyword is a field word and whether each keyword is a key named entity;
the weight value of the keyword which is the key named entity is larger than the weight value of the keyword which is not the key named entity and is not the domain word.
4. A method for training a domain prediction model, the method comprising:
acquiring training data, wherein the training data comprises content texts in hot fields and content texts in remote fields;
on the basis of a first domain prediction model, extracting a hidden state sequence from a content text of the hot domain, and reducing the dimension of the hidden state sequence to obtain a first scoring vector of each keyword in the content text, which is regarded as a domain word;
performing full connection processing on the first component of each keyword in the content text, which is regarded as a domain word, based on a first full connection layer to obtain a first component;
extracting a hidden state sequence from the content text of the uncommon field based on a second field prediction model, and reducing the dimensions of the hidden state sequence to obtain a second scoring vector of each keyword in the content text regarded as a field word;
based on a second full-connection layer, performing full-connection processing on a second scoring vector of each keyword in the content text, which is regarded as a field word, to obtain a second scoring value;
calculating a scoring loss value according to the first scoring value and the second scoring value, calculating a maximum mean difference loss according to the first scoring vector and the second scoring vector, and calculating a joint loss value according to the scoring loss value and the maximum mean difference loss;
and adjusting parameters of the second domain prediction model according to the joint loss value, and determining the adjusted second domain prediction model as the domain prediction model.
5. The domain predictive model training method of claim 4, further comprising: performing word segmentation and embedding processing on the content text in the popular field to obtain a plurality of keyword vectors of the content text in the popular field, and performing word segmentation and embedding processing on the content text in the rarely-used field to obtain a plurality of keyword vectors of the content text in the rarely-used field;
the extracting of the hidden state sequence from the content text of the hot domain based on the first domain prediction model comprises: based on a first domain prediction model, extracting a hidden state sequence from a plurality of keyword vectors of the content text of the hot domain;
the extraction of the hidden state sequence from the content text of the uncommon field based on the second field prediction model comprises the following steps: and extracting a hidden state sequence from a plurality of keyword vectors of the content text in the uncommon field based on a second field prediction model.
6. The method of claim 4 or 5, wherein the first and second domain prediction models each comprise: a Bilstm layer and a linear layer;
the extracting a hidden state sequence from the content text of the popular domain based on the first domain prediction model and obtaining a first component of each keyword in the content text regarded as a domain word by dimensionality reduction of the hidden state sequence comprise:
determining a first hidden state vector from a forward direction of the content text of the trending domain and a second hidden state vector from a backward direction of the content text of the trending domain based on a BILSTM layer of the first domain prediction model;
based on a linear layer of the first domain prediction model, fusing and dimensionality reduction are carried out on the first hidden state vector and the second hidden state vector to obtain a first scoring vector of each keyword in the content text, wherein the keyword is regarded as a domain word;
the extracting a hidden state sequence from the content text of the rarely-used field and reducing the dimension of the hidden state sequence to obtain a second scoring vector of each keyword in the content text regarded as a field word based on a second field prediction model comprises the following steps:
determining a third hidden state vector from the forward direction of the content text of the uncommon field and a fourth hidden state vector from the backward direction of the content text of the uncommon field based on a BILSTM layer of the second field prediction model;
and based on a linear layer of the second domain prediction model, fusing and dimensionality reduction are carried out on the third hidden state vector and the fourth hidden state vector to obtain a second scoring vector, wherein each keyword in the content text is regarded as a domain word.
7. A content search apparatus, comprising:
the word segmentation module is used for acquiring a search text and segmenting the search text to extract a plurality of keywords;
the embedding module is used for embedding the plurality of keywords to obtain a keyword vector of each keyword;
the part-of-speech determining module is used for determining the part-of-speech of each keyword according to the keyword vector based on the trained part-of-speech tagging model;
the domain determining module is used for determining whether each keyword is a domain word or not according to the keyword vector based on a trained domain prediction model;
the weight determining module is used for determining the weight value of each keyword according to the part of speech of each keyword and whether each keyword is a field word, wherein the weight value of the keyword which is the field word is larger than that of the keyword which is not the field word;
and the searching module is used for outputting a searching result according to the keyword and the weight value of the keyword based on a searching engine.
8. A training device for a domain prediction model is characterized by comprising:
the data acquisition module is used for acquiring training data, wherein the training data comprises content texts in hot fields and content texts in remote fields;
the first vector processing module is used for extracting a hidden state sequence from the content text of the popular domain and reducing the dimension of the hidden state sequence to obtain a first scoring vector of each keyword in the content text regarded as a domain word based on a first domain prediction model;
the first connection module is used for carrying out full connection processing on the first component of each keyword in the content text regarded as the domain word based on a first full connection layer to obtain a first component;
the second vector processing module is used for extracting a hidden state sequence from the content text of the remote field and reducing the dimension of the hidden state sequence to obtain a second scoring vector of each keyword in the content text regarded as a field word based on a second field prediction model;
the second connection module is used for carrying out full connection processing on a second scoring vector of each keyword in the content text regarded as a field word based on a second full connection layer to obtain a second scoring value;
a loss determination module, configured to calculate a scoring loss value according to the first scoring value and the second scoring value, calculate a maximum mean difference loss according to the first scoring vector and the second scoring vector, and calculate a joint loss value according to the scoring loss value and the maximum mean difference loss;
and the parameter adjusting module is used for adjusting the parameters of the second domain prediction model according to the joint loss value and determining the adjusted second domain prediction model as the domain prediction model.
9. A computer device, wherein the computer device comprises a memory and a processor;
the memory for storing a computer program;
the processor is used for executing the computer program and realizing the following when the computer program is executed:
the content search method according to any one of claims 1 to 3; and/or
The domain predictive model training method of any one of claims 4 to 6.
10. A computer-readable storage medium, the computer-readable storage medium storing a computer program, wherein execution of the computer program by a processor results in:
the content search method according to any one of claims 1 to 3; and/or
The domain predictive model training method of any one of claims 4 to 6.
CN202010700846.1A 2020-07-20 2020-07-20 Content search method, field prediction model training method, device and storage medium Active CN111737560B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010700846.1A CN111737560B (en) 2020-07-20 2020-07-20 Content search method, field prediction model training method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010700846.1A CN111737560B (en) 2020-07-20 2020-07-20 Content search method, field prediction model training method, device and storage medium

Publications (2)

Publication Number Publication Date
CN111737560A true CN111737560A (en) 2020-10-02
CN111737560B CN111737560B (en) 2021-01-08

Family

ID=72655180

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010700846.1A Active CN111737560B (en) 2020-07-20 2020-07-20 Content search method, field prediction model training method, device and storage medium

Country Status (1)

Country Link
CN (1) CN111737560B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112507109A (en) * 2020-12-11 2021-03-16 重庆知识产权大数据研究院有限公司 Retrieval method and device based on semantic analysis and keyword recognition
CN112559895A (en) * 2021-02-19 2021-03-26 深圳平安智汇企业信息管理有限公司 Data processing method and device, electronic equipment and storage medium
CN113434636A (en) * 2021-06-30 2021-09-24 平安科技(深圳)有限公司 Semantic-based approximate text search method and device, computer equipment and medium
CN113609248A (en) * 2021-08-20 2021-11-05 北京金山数字娱乐科技有限公司 Word weight generation model training method and device and word weight generation method and device

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102081601A (en) * 2009-11-27 2011-06-01 北京金山软件有限公司 Field word identification method and device
CN105512224A (en) * 2015-11-30 2016-04-20 清华大学 Search engine user satisfaction automatic assessment method based on cursor position sequence
CN105808529A (en) * 2016-03-10 2016-07-27 武汉传神信息技术有限公司 Method and device of corpora division field
CN107133220A (en) * 2017-06-07 2017-09-05 东南大学 Name entity recognition method in a kind of Geography field
US20180101532A1 (en) * 2016-10-06 2018-04-12 Oracle International Corporation Searching data sets
CN109446514A (en) * 2018-09-18 2019-03-08 平安科技(深圳)有限公司 Construction method, device and the computer equipment of news property identification model
CN110162795A (en) * 2019-05-30 2019-08-23 重庆大学 A kind of adaptive cross-cutting name entity recognition method and system
US10592542B2 (en) * 2017-08-31 2020-03-17 International Business Machines Corporation Document ranking by contextual vectors from natural language query

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102081601A (en) * 2009-11-27 2011-06-01 北京金山软件有限公司 Field word identification method and device
CN105512224A (en) * 2015-11-30 2016-04-20 清华大学 Search engine user satisfaction automatic assessment method based on cursor position sequence
CN105808529A (en) * 2016-03-10 2016-07-27 武汉传神信息技术有限公司 Method and device of corpora division field
US20180101532A1 (en) * 2016-10-06 2018-04-12 Oracle International Corporation Searching data sets
CN107133220A (en) * 2017-06-07 2017-09-05 东南大学 Name entity recognition method in a kind of Geography field
US10592542B2 (en) * 2017-08-31 2020-03-17 International Business Machines Corporation Document ranking by contextual vectors from natural language query
CN109446514A (en) * 2018-09-18 2019-03-08 平安科技(深圳)有限公司 Construction method, device and the computer equipment of news property identification model
CN110162795A (en) * 2019-05-30 2019-08-23 重庆大学 A kind of adaptive cross-cutting name entity recognition method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
钟文波: "搜索引擎中关键词分类方法评估及推荐应用", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112507109A (en) * 2020-12-11 2021-03-16 重庆知识产权大数据研究院有限公司 Retrieval method and device based on semantic analysis and keyword recognition
CN112559895A (en) * 2021-02-19 2021-03-26 深圳平安智汇企业信息管理有限公司 Data processing method and device, electronic equipment and storage medium
CN112559895B (en) * 2021-02-19 2021-05-18 深圳平安智汇企业信息管理有限公司 Data processing method and device, electronic equipment and storage medium
CN113434636A (en) * 2021-06-30 2021-09-24 平安科技(深圳)有限公司 Semantic-based approximate text search method and device, computer equipment and medium
CN113434636B (en) * 2021-06-30 2024-06-18 平安科技(深圳)有限公司 Semantic-based approximate text searching method, semantic-based approximate text searching device, computer equipment and medium
CN113609248A (en) * 2021-08-20 2021-11-05 北京金山数字娱乐科技有限公司 Word weight generation model training method and device and word weight generation method and device

Also Published As

Publication number Publication date
CN111737560B (en) 2021-01-08

Similar Documents

Publication Publication Date Title
CN106997382B (en) Innovative creative tag automatic labeling method and system based on big data
Kausar et al. A sentiment polarity categorization technique for online product reviews
CN112131350B (en) Text label determining method, device, terminal and readable storage medium
Alami et al. Unsupervised neural networks for automatic Arabic text summarization using document clustering and topic modeling
CN111737560B (en) Content search method, field prediction model training method, device and storage medium
CN109933664B (en) Fine-grained emotion analysis improvement method based on emotion word embedding
Sebastiani Classification of text, automatic
US20130060769A1 (en) System and method for identifying social media interactions
CN110377725B (en) Data generation method and device, computer equipment and storage medium
CN112307336A (en) Hotspot information mining and previewing method and device, computer equipment and storage medium
CN113988057A (en) Title generation method, device, equipment and medium based on concept extraction
Klochikhin et al. Text analysis
CN112270189B (en) Question type analysis node generation method, system and storage medium
Andriyanov Combining Text and Image Analysis Methods for Solving Multimodal Classification Problems
CN115878752A (en) Text emotion analysis method, device, equipment, medium and program product
CN114255067A (en) Data pricing method and device, electronic equipment and storage medium
Medić et al. A survey of citation recommendation tasks and methods
CN117194616A (en) Knowledge query method and device for vertical domain knowledge graph, computer equipment and storage medium
CN107729509B (en) Discourse similarity determination method based on recessive high-dimensional distributed feature representation
Anuradha et al. Fuzzy based summarization of product reviews for better analysis
Sariki et al. A book recommendation system based on named entities
Saeed et al. An automated system to predict popular cybersecurity news using document embeddings
CN113688633A (en) Outline determination method and device
Dasgupta et al. A Survey of Numerous Text Similarity Approach
CN118170899B (en) AIGC-based media news manuscript generation method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant