CN116501831A - Problem recall method, device, equipment and storage medium - Google Patents

Problem recall method, device, equipment and storage medium Download PDF

Info

Publication number
CN116501831A
CN116501831A CN202210057608.2A CN202210057608A CN116501831A CN 116501831 A CN116501831 A CN 116501831A CN 202210057608 A CN202210057608 A CN 202210057608A CN 116501831 A CN116501831 A CN 116501831A
Authority
CN
China
Prior art keywords
recall
query
semantic
determining
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210057608.2A
Other languages
Chinese (zh)
Inventor
纪兴光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN202210057608.2A priority Critical patent/CN116501831A/en
Publication of CN116501831A publication Critical patent/CN116501831A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the technical field of search engines, and discloses a problem recall method, device, equipment and storage medium. The method comprises the following steps: determining semantic vector features corresponding to the target query problem; obtaining a corresponding semantic vector set to be recalled according to the semantic vector features; and determining recall problems according to the semantic vector set to be recalled so as to complete problem recall. Through the mode, the accurate recall of the query problem is realized. According to the invention, the recall is carried out by calculating the semantic vector features of the query problem and matching the problem with the same semantic, the search intention of the user is accurately captured, the accurate recall can also be carried out on the result with similar semantic but unmatched word, the recall accuracy is improved, and the recall rate is improved.

Description

Problem recall method, device, equipment and storage medium
Technical Field
The present invention relates to the field of search engine technologies, and in particular, to a method, an apparatus, a device, and a storage medium for recall of a problem.
Background
In the search task, it is very difficult to accurately calculate the search intention of the user and accurately characterize the semantics of the document from the Query of the user. The existing search algorithm is mainly used for searching by combining keyword matching with inverted index, has limited generalization performance, and is difficult to accurately recall results with similar semantics but unmatched words, so that the final effect is affected.
The foregoing is provided merely for the purpose of facilitating understanding of the technical solutions of the present invention and is not intended to represent an admission that the foregoing is prior art.
Disclosure of Invention
The invention mainly aims to provide a problem recall method, device, equipment and storage medium, and aims to solve the technical problem that the search intention of a user cannot be accurately calculated in the search task in the prior art.
In order to achieve the above object, the present invention provides a problem recall method, comprising the steps of:
determining semantic vector features corresponding to the target query problem;
obtaining a corresponding semantic vector set to be recalled according to the semantic vector features;
and determining recall problems according to the semantic vector set to be recalled so as to complete problem recall.
Optionally, the obtaining a corresponding semantic vector set to be recalled according to the semantic vector features includes:
and inputting a preset target semantic recall model according to the semantic vector features to obtain a semantic vector set to be recalled.
Optionally, before the inputting the preset target semantic recall model according to the semantic vector features to obtain the semantic vector set to be recalled, the method further includes:
acquiring first training data, wherein the first training data comprises first sampling query data, first positive sample data and first negative sample data;
Training an initial semantic recall model according to the first training data to obtain a semantic model to be optimized;
determining second training data according to the first training data, wherein the second training data comprises second sampling query data, second positive sample data and second negative sample data, and the ratio of the second positive sample data to the second negative sample data is greater than or equal to the ratio of the first positive sample data to the first negative sample data;
training the semantic model to be optimized according to the second training data to obtain a target semantic recall model.
Optionally, before the acquiring the first training data, the method further includes:
determining a common sample according to a preset question-answer data set;
acquiring a sampling network address, wherein the sampling network address is a network address with a query function;
determining a plurality of associated query information according to the sampling network address information;
obtaining a difficult sample according to each piece of associated query information;
and generating first training data according to the common sample and the difficult sample.
Optionally, the obtaining the difficult sample according to each associated query information includes:
determining a query information pair according to the associated query information;
Determining click information of the query information pair;
screening the query information pairs according to the click information to obtain effective query information pairs;
and generating a difficult sample according to the query information pair.
Optionally, the obtaining the difficult sample according to each associated query information includes:
determining a current query text and a query result corresponding to the current query text according to the associated query information;
determining the click rate of each inquiry result;
determining an effective query result according to the click rate;
and generating a difficult sample according to the target query text and the effective query result.
Optionally, the determining the valid query result according to the click rate includes:
acquiring the number of query results corresponding to the current query text;
and determining effective query results according to the number of the query results and the click rate of the query results.
Optionally, training the initial semantic recall model according to the first training data to obtain the semantic model to be optimized includes:
bringing the first training data into an initial semantic recall model to obtain semantic recall characterization;
calculating a loss value according to the semantic vector representation;
and adjusting the initial semantic recall model according to the loss value until the model converges to obtain the semantic model to be optimized.
Optionally, determining the recall problem according to the semantic vector set to be recalled to complete the problem recall, including:
carrying out keyword analysis according to the target query problem to obtain keyword information;
obtaining an alternative recall problem according to the keyword information;
and determining a recall problem according to the semantic vector set and the alternative recall problem to complete the recall of the problem.
Optionally, determining the recall problem according to the semantic vector set to be recalled to complete the problem recall, including:
according to the semantic vector features and the semantic vector feature set to be recalled, matching a preset vector feature library to obtain a first to-be-recalled problem corresponding to the semantic vector features and a second to-be-recalled problem corresponding to the semantic vector feature set to be recalled;
and determining the recall problem according to the first to-be-recalled problem and the second to-be-recalled problem so as to complete the problem recall.
In addition, in order to achieve the above object, the present invention also proposes a problem recall device, including:
the determining module is used for determining semantic vector features corresponding to the target query problem;
the processing module is used for obtaining a corresponding semantic vector set to be recalled according to the semantic vector features;
And the processing module is also used for determining recall problems according to the semantic vector set to be recalled so as to complete problem recall.
Optionally, the processing module is further configured to input a preset target semantic recall model according to the semantic vector features, so as to obtain a semantic vector set to be recalled.
Optionally, the processing module is further configured to obtain first training data, where the first training data includes first sampling query data, first positive sample data, and first negative sample data;
training an initial semantic recall model according to the first training data to obtain a semantic model to be optimized;
determining second training data according to the first training data, wherein the second training data comprises second sampling query data, second positive sample data and second negative sample data, and the ratio of the second positive sample data to the second negative sample data is greater than or equal to the ratio of the first positive sample data to the first negative sample data;
training the semantic model to be optimized according to the second training data to obtain a target semantic recall model.
Optionally, the processing module is further configured to determine a common sample according to a preset question-answer data set;
Acquiring a sampling network address, wherein the sampling network address is a network address with a query function;
determining a plurality of associated query information according to the sampling network address information;
obtaining a difficult sample according to each piece of associated query information;
and generating first training data according to the common sample and the difficult sample.
Optionally, the processing module is further configured to determine a query information pair according to the associated query information;
determining click information of the query information pair;
screening the query information pairs according to the click information to obtain effective query information pairs;
and generating a difficult sample according to the query information pair.
Optionally, the processing module is further configured to determine a current query text and a query result corresponding to the current query text according to the associated query information;
determining the click rate of each inquiry result;
determining an effective query result according to the click rate;
and generating a difficult sample according to the target query text and the effective query result.
Optionally, the processing module is further configured to obtain the number of query results corresponding to the current query text;
and determining effective query results according to the number of the query results and the click rate of the query results.
Optionally, the processing module is further configured to bring the first training data into an initial semantic recall model to obtain a semantic recall feature;
calculating a loss value according to the semantic vector representation;
and adjusting the initial semantic recall model according to the loss value until the model converges to obtain the semantic model to be optimized.
In addition, in order to achieve the above object, the present invention also proposes a problem recall apparatus including: a memory, a processor, and a problem recall program stored on the memory and executable on the processor, the problem recall program configured to implement the steps of the problem recall method as described above.
In addition, in order to achieve the above object, the present invention also proposes a storage medium having a problem recall program stored thereon, which when executed by a processor, implements the steps of the problem recall method as described above.
The method comprises the steps of determining semantic vector features corresponding to a target query problem; obtaining a corresponding semantic vector set to be recalled according to the semantic vector features; and determining recall problems according to the semantic vector set to be recalled so as to complete problem recall. Through the mode, the accurate recall of the query problem is realized. According to the invention, the recall is carried out by calculating the semantic vector features of the query problem and matching the problem with the same semantic, the search intention of the user is accurately captured, the accurate recall can also be carried out on the result with similar semantic but unmatched word, the recall accuracy is improved, and the recall rate is improved.
Drawings
FIG. 1 is a schematic diagram of a problem recall device for a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a flowchart of a first embodiment of the problem recall method of the present invention;
FIG. 3 is a flowchart of a second embodiment of the problem recall method of the present invention;
FIG. 4 is a block diagram of a first embodiment of the problem recall device of the present invention.
The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Referring to fig. 1, fig. 1 is a schematic structural diagram of a problem recall device in a hardware running environment according to an embodiment of the present invention.
As shown in fig. 1, the question recall apparatus may include: a processor 1001, such as a central processing unit (Central Processing Unit, CPU), a communication bus 1002, a user interface 1003, a network interface 1004, a memory 1005. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a Wireless interface (e.g., a Wireless-Fidelity (Wi-Fi) interface). The Memory 1005 may be a high-speed random access Memory (Random Access Memory, RAM) Memory or a stable nonvolatile Memory (NVM), such as a disk Memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.
Those skilled in the art will appreciate that the configuration shown in FIG. 1 is not limiting of the problem recall device and may include more or fewer components than shown, or may combine certain components, or may be a different arrangement of components.
As shown in fig. 1, an operating system, a network communication module, a user interface module, and a problem recall program may be included in the memory 1005 as one type of storage medium.
In the problem recall device shown in fig. 1, the network interface 1004 is mainly used for data communication with a network server; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 in the problem recall device of the present invention may be provided in the problem recall device, and the problem recall device calls the problem recall program stored in the memory 1005 through the processor 1001 and executes the problem recall method provided by the embodiment of the present invention.
An embodiment of the present invention provides a problem recall method, referring to fig. 2, and fig. 2 is a schematic flow chart of a first embodiment of a problem recall method according to the present invention.
In this embodiment, the problem recall method includes the following steps:
step S10: and determining semantic vector features corresponding to the target query problem.
It should be noted that, the execution body of the embodiment is a semantic recall system, and the semantic recall system may be set in a server, or may be set in another terminal device with the same or similar function as the server, which is not limited in this embodiment.
It can be appreciated that this embodiment is not limited to this application in the intelligent question-answering, answer recommendation and search processes, and it is very difficult to accurately calculate the search intention of the user and accurately characterize the semantics of the document from the user Query in the search task. The existing search algorithm is mainly used for searching by combining keyword matching with inverted index, has limited generalization performance, and particularly aims at long-tail queries, and is difficult to accurately recall results with similar semantics but unmatched characters, so that the final effect is influenced. Therefore, in order to solve the problem that the current system is recalled purely by Term, we improve by semantic recall. And obtaining a corresponding semantic vector set to be recalled according to the semantic vector features, and determining a recall problem according to the semantic vector set to be recalled, so that the problem that the text is not high in similarity but the semantic similarity is high is avoided from being recalled, and the recall rate and the recall accuracy are improved.
It should be noted that, the target query question is text input by the user when searching or text information obtained by converting the input voice, and the text is generally a question for querying a corresponding answer.
It should be understood that the semantic vector features may be obtained according to a semantic feature vector model, i.e. the text is input into the semantic feature vector model to obtain the corresponding feature vector.
Step S20: and obtaining a corresponding semantic vector set to be recalled according to the semantic vector features.
It can be understood that the process of obtaining the corresponding semantic vector set to be recalled according to the semantic vector features may be searching the semantic vector set according to the semantic vector features to obtain the semantic vector set to be recalled matched with the semantic vector set.
The semantic vector set is a semantic vector data set precipitated in the training process of the client or a semantic vector data set of a third party.
Specifically, the semantic vector features can be decomposed into different vector labels, then the associated semantic vector features are found in the semantic vector set by taking the vector labels as indexes, and then the similarity between the semantic vector features and the associated semantic vector features is calculated to form a semantic vector set to be recalled. The similarity calculation process can calculate cosine values of two semantic vector features, and then adds the associated semantic vector features corresponding to the similarity threshold value higher than the set similarity threshold value into the semantic vector set to be recalled according to the similarity of the cosine values when the similarity is higher than the set similarity threshold value.
Step S30: and determining recall problems according to the semantic vector set to be recalled so as to complete problem recall.
It should be noted that, according to the semantic vector set, problem information corresponding to each semantic vector in the semantic vector set may be found, for example: the problem number, the problem information here can find the corresponding recall problem to recall.
Specifically, the problem information is formed by completing the capture of the semantic vector in advance of the problem library and then associating the problem information with the semantic vector.
In this embodiment, keyword analysis is performed according to the target query problem to obtain keyword information; obtaining an alternative recall problem according to the keyword information; and determining a recall problem according to the semantic vector set and the alternative recall problem to complete the recall of the problem.
It should be noted that, a keyword recall can be added in addition to the semantic recall path, because the semantic recall is based on the text, the association degree is mainly based on the text, and the keyword recall can obtain a similar problem with higher text similarity, so that the semantic recall system can have higher recall rate by combining the two.
Specifically, keyword analysis is performed according to the target query problem, keyword information in the target query problem is analyzed, and related problems are searched for according to different keyword information in a weight ratio for recall, for example: when the target query question is "the temperature of the sun", keyword information such as "the sun", "the temperature", "the number of the sun" and the like can be analyzed, and the weight ratio of the "sun" to the "temperature" is definitely higher than the "number of the sun", "the temperature" and the number of the heat ", so that the problem of recall comprises the" sun "," the temperature "and the similarity of the problem of synonyms of the sun", "the temperature" and the number of the heat and the target query problem is higher than the similarity of the problem of the "sun" and the number of the heat and the target query problem, and finally the problem needing recall is selected according to the similarity.
In this embodiment, matching a preset vector feature library according to the semantic vector features and the semantic vector feature set to be recalled to obtain a first problem to be recalled corresponding to the semantic vector features and a second problem to be recalled corresponding to the semantic vector feature set to be recalled; and determining the recall problem according to the first to-be-recalled problem and the second to-be-recalled problem so as to complete the problem recall.
It should be noted that, although the similarity between the elements in the feature set of the semantic vector to be recalled and the feature of the semantic vector is high, the most accurate probability is still that the semantic vector features corresponding to the target query statement input by the user itself, so that if the recall of the question is performed only according to the feature set of the semantic vector to be recalled, the most accurate answer part may be omitted. The semantic feature vectors corresponding to the target query statement are required to be fused with the semantic vector feature sets to be recalled, so that a first recall problem corresponding to the semantic feature vectors and a second recall problem corresponding to the semantic vector feature sets to be recalled are obtained, and then the first recall problem and the second recall problem are fused, so that a final recall problem is obtained.
The embodiment determines semantic vector features corresponding to the target query problem; obtaining a corresponding semantic vector set to be recalled according to the semantic vector features; and determining recall problems according to the semantic vector set to be recalled so as to complete problem recall. Through the mode, the accurate recall of the query problem is realized. According to the invention, the recall is carried out by calculating the semantic vector features of the query problem and matching the problem with the same semantic, the search intention of the user is accurately captured, the accurate recall can also be carried out on the result with similar semantic but unmatched word, the recall accuracy is improved, and the recall rate is improved.
Referring to fig. 3, fig. 3 is a flowchart illustrating a problem recall method according to a second embodiment of the present invention.
Based on the first embodiment, the problem recall method in this embodiment further includes, in the step S20:
step S21: in this embodiment, a preset target semantic recall model is input according to the semantic vector features to obtain a semantic vector set to be recalled.
It should be noted that, in the recall problem process, a preset target semantic recall model may be input according to the semantic vector features to obtain a set of semantic vectors to be recalled, and the preset target semantic recall model may input semantic vector feature data and match a plurality of semantic vectors to be recalled with higher similarity.
In this embodiment, a preferable training process of the preset target semantic recall model is provided, and the training steps are as follows: step S211: acquiring first training data, wherein the first training data comprises first sampling query data, first positive sample data and first negative sample data; step S212: training an initial semantic recall model according to the first training data to obtain a semantic model to be optimized; step S213: determining second training data according to the first training data, wherein the second training data comprises second sampling query data, second positive sample data and second negative sample data, and the ratio of the second positive sample data to the second negative sample data is greater than or equal to the ratio of the first positive sample data to the first negative sample data; step S214: training the semantic model to be optimized according to the second training data to obtain a target semantic recall model.
It should be noted that, when the preset target semantic recall model is used, the result is recalled by calculating the similarity between vectors, so that semantic retrieval based on semantic vector features is realized. In the sentence vector representation task, the general model has the problem that the quality of the output vector is low, and it is difficult to reflect the similarity between two sentences. Mainly because native vector expressions tend to encode all vectors into a smaller space, which results in a high semantic similarity score for most sentence pairs. Therefore, the preset target semantic recall model is specifically trained, and particularly the training method is optimized, so that the high-quality index in searching hundreds of billions of data can be realized.
It should be noted that, in this embodiment, the training of two processes using the first training data and the second training data is based on a contrast learning method, and the contrast learning is a common self-supervision learning method. The core idea is to pull the distance of the positive samples closer and the distance of the negative samples farther. The common method is that the comparison study is carried out through one positive sample and K negative samples, and the research shows that the larger K is, the better the effect is. However, the first procedure uses the method of "Query (Query data) +one positive sample+four negative samples", only 3 negative samples can be seen per Query, but training in this way only limits the model to further learn sentence characterization. Therefore, we consider the sample data of all other samples in the same Batch as the negative example of the current Query during the second training of the model by comparing the idea of learning. Each sample becomes "query + one positive sample + one negative sample" form, but n becomes Batch Size 2 at training. In addition, the Batch Size is limited by the number of video memories and video cards, so that in order to make the model touch more negative examples as much as possible, a part of data to be predicted is stored in the Memory by a Memory Bank method, and the Batch Size is improved from 512 to 4096, so that the model performance is further improved.
Wherein, step S211: first training data is obtained, wherein the first training data comprises first sampling query data, first positive sample data and first negative sample data.
It should be noted that, the first training data is data for training the model, which includes first sampling query data (query), first positive sample data and first negative sample data, where the first positive sample data is a problem that the actual similarity with the first sampling query data is high, and the first negative sample data is a problem that the similarity with the first sampling query data is at the bottom or has no connection.
In the embodiment, a common sample is determined according to a preset question-answer data set; acquiring a sampling network address, wherein the sampling network address is a network address with a query function; determining a plurality of associated query information according to the sampling network address information; obtaining a difficult sample according to each piece of associated query information; and generating first training data according to the common sample and the difficult sample.
The sampling network address is a network address with a query function, for example: the URL of the 'mobile phone Liangquan' is used as a sampling network address, and associated query information is mined according to the URL, wherein the associated query information is related queries or titles associated with the queries, and related data such as click rate, click times and the like of the queries and the titles. According to the data, the association degree between different queries and titles can be obtained, and the data with the associated data but low actual semantic similarity is selected as a difficult sample.
It should be noted that a random sample is used as a negative example in general training. The method ensures that the state is similar to the on-line application scene during training, but the semantic relevance of the randomly sampled negative examples and positive examples Title is very low, so that the distinguishing task is too simple, the model converges after training for several rounds, and the distinguishing capability of similar samples cannot be further improved. Therefore, we excavate part of samples by the method of this embodiment, excavate a batch of Title (subject) as difficult samples (Hard Negative Sample), and this part of samples have a certain coincidence in terms of words and Query, but have poor semantic relevance. The part of data is mixed into training data, so that task distinguishing difficulty can be improved, and distinguishing capability of a model can be improved. The difficult sample data may be obtained by manual mining, or may be automatically mined to combine with manual auditing, which is not limited in this embodiment.
In this embodiment, a query information pair is determined according to the associated query information; determining click information of the query information pair; screening the query information pairs according to the click information to obtain effective query information pairs; and generating a difficult sample according to the query information pair.
In a specific implementation, the present embodiment proposes a preferred scheme of difficult sample mining, as follows, a query information pair is determined according to the associated query information, where the query information pair is a query that can be queried mutually, for example: query1 is what is the mobile phone number, query2 is what is the "what is the number comparing Jili", information corresponding to Query2 is recommended when Query1 is queried, information corresponding to Query1 is recommended when Query2 is queried, then Query1 and Query2 are information pairs, then Query pairs which are commonly displayed under the same URL are mined at this time, but Query pairs which are not clicked together can be mined as difficult samples, for example: when the related information of the query statement "what the mobile phone number is used for" is recommended, no one clicks the recommendation question of "what number is more good for", and the same is true in the opposite way. Then it is considered that the two are not co-clicked, and the recommendation system that states the website defaults to the two being associated but in fact the user does not think so, so this is ideal as a difficult sample. The query1 is only required to be used as sampling query data in the first training data, the query2 is used as a negative sample, and the whole sample acquisition mode is not limited.
In this embodiment, a current query text and a query result corresponding to the current query text are determined according to the associated query information; determining the click rate of each inquiry result; determining an effective query result according to the click rate; and generating a difficult sample according to the target query text and the effective query result.
It should be noted that, in the process of determining the current query text and the query result corresponding to the current query text according to the associated query information, the query result refers to a related problem recommended according to the current query text, for example: when the current problem query3 is "what kind of mobile phone number is more good", a plurality of related problems are recommended, such as title1 "what number of mobile phones is good, what number of mobile phones is most good", title2 "what number of mobile phones is good, and what number of mobile phones is good for the horse, title3" the person belonging to the horse is good, at this time, the current query text is query3, the corresponding query results are title1, title2 and title3, and the click rate corresponding to the recommended query results is determined respectively, the click rate is high, so that the user considers that the semantic similarity degree of the two is high, and the click rate indicates that the user with low similarity may feel interested in occasionally to click, for example: the clicking rates of title1, title2 and title3 are respectively 0.4, 0.15 and 0, so that title3 with the clicking rate of 0 can be selected as a negative sample, title1 with the highest clicking rate is a positive sample, and query3 generates first training data for sampling query data.
In this embodiment, the number of query results corresponding to the current query text is obtained; and determining effective query results according to the number of the query results and the click rate of the query results.
It can be understood that the number of query results is the number of questions recommended in the current page recommendation system, if a more accurate answer is desired, the number of query results can be determined according to the click rate of the query results, because the more and more posterior titles are more difficult to see, the more anterior positions are likely to be clicked, but the more anterior positions are likely to be clicked by someone due to the positional relationship, the more anterior positions are likely to be related to the similarity, so that the more anterior positions and the click rates need to be selected together according to the combination of the positions and the click rates of the query results, and a plurality of query results smaller than the click rate threshold and the anterior most are selected as negative samples to construct the first training data.
Step S212: and training the initial semantic recall model according to the first training data to obtain a semantic model to be optimized.
It should be noted that, the initial semantic recall model is an initial model that has not been trained yet, for example: the BERT model, the present embodiment does not limit the type of the initial model. The training process of the model is to shorten the distance of the positive sample and lengthen the distance of the negative sample until the model converges.
In this embodiment, the first training data is brought into an initial semantic recall model to obtain a semantic recall feature; calculating a loss value according to the semantic vector representation; and adjusting the initial semantic recall model according to the loss value until the model converges to obtain the semantic model to be optimized.
It should be noted that, the calculation of the loss value may be performed by a loss function, and this embodiment proposes a preferred scheme, and the calculation of the loss value is performed by cross entropy, where the calculation formula is as follows:
where n is the number of samples (positive samples+negative samples) such as: where each training data is in the format of "query+1 positive sample+4 negative sample", n=5, p (xi) is Label for sample xi: positive samples were 1 and negative samples were 0.q (xi) is the probability that the sample xi is predicted as a positive example, H (p, q) is a loss value, and the initial semantic recall model is adjusted according to the loss value until the model converges, so that the semantic model to be optimized is obtained.
Step S213: and determining second training data according to the first training data, wherein the second training data comprises second sampling query data, second positive sample data and second negative sample data, and the ratio of the second positive sample data to the second negative sample data is greater than or equal to the ratio of the first positive sample data to the first negative sample data.
It should be noted that when the ratio of the second positive sample data to the second negative sample data is equal to the ratio of the first positive sample data to the first negative sample data, the first training data and the second training data are identical, and then the training process of the first training data and the second training data is identical, where the improvement is that in general, a method of "query+one positive sample+four negative samples" is adopted, each Query can only see 3 negative samples, and further learning sentence characterization by the model is limited. Therefore, we consider the Title of all other samples in the same Batch as the negative example of the current Query by comparing the learning ideas. Each sample becomes a "query + one positive sample + one negative sample" form.
Further, when the ratio of the second positive sample data to the second negative sample data is greater than the ratio of the first positive sample data to the first negative sample data, the first training data may be trained first, for example: taking positive samples: negative sample = 1:4, training the proportion to obtain a transition model, and then passing through second training data to obtain a positive sample: negative sample = 1:1 to improve the distinguishing capability of the model.
Step S214: training the semantic model to be optimized according to the second training data to obtain a target semantic recall model.
It can be understood that training the semantic model to be optimized according to the second training data can obtain a target semantic recall model, and the training process is consistent with the training process corresponding to the first training data.
The embodiment obtains first training data, wherein the first training data comprises first sampling query data, first positive sample data and first negative sample data; training an initial semantic recall model according to the first training data to obtain a semantic model to be optimized; determining second training data according to the first training data, wherein the second training data comprises second sampling query data, second positive sample data and second negative sample data, and the ratio of the second positive sample data to the second negative sample data is greater than or equal to the ratio of the first positive sample data to the first negative sample data; training the semantic model to be optimized according to the second training data to obtain a target semantic recall model. Through the mode, training of the preset target semantic recall model is achieved, distinguishing capability of the model is improved based on the difficult sample, and accuracy of the model is improved.
In addition, the embodiment of the invention also provides a storage medium, wherein the storage medium is stored with a problem recall program, and the problem recall program realizes the steps of the problem recall method when being executed by a processor.
Referring to fig. 4, fig. 4 is a block diagram showing the structure of a first embodiment of the problem recall device of the present invention.
As shown in fig. 4, the problem recall device according to the embodiment of the present invention includes:
a determining module 10, configured to determine semantic vector features corresponding to the target query problem;
the processing module 20 is configured to obtain a corresponding semantic vector set to be recalled according to the semantic vector features;
the processing module 20 is further configured to determine a recall problem according to the semantic vector set to be recalled, so as to complete problem recall.
It should be understood that the foregoing is illustrative only and is not limiting, and that in specific applications, those skilled in the art may set the invention as desired, and the invention is not limited thereto.
The determining module 10 of the embodiment determines semantic vector features corresponding to the target query problem; the processing module 20 obtains a corresponding semantic vector set to be recalled according to the semantic vector features; processing module 20 determines recall questions from the set of semantic vectors to recall to complete a question recall. Through the mode, the accurate recall of the query problem is realized. According to the invention, the recall is carried out by calculating the semantic vector features of the query problem and matching the problem with the same semantic, the search intention of the user is accurately captured, the accurate recall can also be carried out on the result with similar semantic but unmatched word, the recall accuracy is improved, and the recall rate is improved.
In an embodiment, the processing module 20 is further configured to input a preset target semantic recall model according to the semantic vector features, so as to obtain a set of semantic vectors to be recalled.
In an embodiment, the processing module 20 is further configured to obtain first training data, where the first training data includes first sampling query data, first positive sample data, and first negative sample data;
training an initial semantic recall model according to the first training data to obtain a semantic model to be optimized;
determining second training data according to the first training data, wherein the second training data comprises second sampling query data, second positive sample data and second negative sample data, and the ratio of the second positive sample data to the second negative sample data is greater than or equal to the ratio of the first positive sample data to the first negative sample data;
training the semantic model to be optimized according to the second training data to obtain a target semantic recall model.
In one embodiment, the processing module 20 is further configured to determine a common sample according to a preset question-answer data set;
acquiring a sampling network address, wherein the sampling network address is a network address with a query function;
Determining a plurality of associated query information according to the sampling network address information;
obtaining a difficult sample according to each piece of associated query information;
and generating first training data according to the common sample and the difficult sample.
In an embodiment, the processing module 20 is further configured to determine a query information pair according to the associated query information;
determining click information of the query information pair;
screening the query information pairs according to the click information to obtain effective query information pairs;
and generating a difficult sample according to the query information pair.
In an embodiment, the processing module 20 is further configured to determine a current query text and a query result corresponding to the current query text according to the associated query information;
determining the click rate of each inquiry result;
determining an effective query result according to the click rate;
and generating a difficult sample according to the target query text and the effective query result.
In an embodiment, the processing module 20 is further configured to obtain the number of query results corresponding to the current query text;
and determining effective query results according to the number of the query results and the click rate of the query results.
In an embodiment, the processing module 20 is further configured to bring the first training data into an initial semantic recall model to obtain a semantic recall representation;
Calculating a loss value according to the semantic vector representation;
and adjusting the initial semantic recall model according to the loss value until the model converges to obtain the semantic model to be optimized.
In an embodiment, the processing module 20 is further configured to perform keyword analysis according to the target query question to obtain keyword information;
obtaining an alternative recall problem according to the keyword information;
and determining a recall problem according to the semantic vector set and the alternative recall problem to complete the recall of the problem.
In an embodiment, the processing module 20 is further configured to match a preset vector feature library according to the semantic vector feature and the semantic vector feature set to be recalled, so as to obtain a first problem to be recalled corresponding to the semantic vector feature and a second problem to be recalled corresponding to the semantic vector feature set to be recalled;
and determining the recall problem according to the first to-be-recalled problem and the second to-be-recalled problem so as to complete the problem recall.
It should be noted that the above-described working procedure is merely illustrative, and does not limit the scope of the present invention, and in practical application, a person skilled in the art may select part or all of them according to actual needs to achieve the purpose of the embodiment, which is not limited herein.
In addition, technical details not described in detail in this embodiment may refer to the problem recall method provided in any embodiment of the present invention, and are not described herein.
Furthermore, it should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. Read Only Memory)/RAM, magnetic disk, optical disk) and including several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.
The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.
The application also discloses A1, a problem recall method, the problem recall method includes:
determining semantic vector features corresponding to the target query problem;
obtaining a corresponding semantic vector set to be recalled according to the semantic vector features;
and determining recall problems according to the semantic vector set to be recalled so as to complete problem recall.
A2, the method of A1, the obtaining the corresponding semantic vector set to be recalled according to the semantic vector features, includes:
and inputting a preset target semantic recall model according to the semantic vector features to obtain a semantic vector set to be recalled.
A3, the method according to A2, before inputting the preset target semantic recall model according to the semantic vector features to obtain the semantic vector set to be recalled, further comprises:
acquiring first training data, wherein the first training data comprises first sampling query data, first positive sample data and first negative sample data;
Training an initial semantic recall model according to the first training data to obtain a semantic model to be optimized;
determining second training data according to the first training data, wherein the second training data comprises second sampling query data, second positive sample data and second negative sample data, and the ratio of the second positive sample data to the second negative sample data is greater than or equal to the ratio of the first positive sample data to the first negative sample data;
training the semantic model to be optimized according to the second training data to obtain a target semantic recall model.
The method of A4, the method of A3, before the obtaining the first training data, further includes:
determining a common sample according to a preset question-answer data set;
acquiring a sampling network address, wherein the sampling network address is a network address with a query function;
determining a plurality of associated query information according to the sampling network address information;
obtaining a difficult sample according to each piece of associated query information;
and generating first training data according to the common sample and the difficult sample.
The method of A5, A4, wherein obtaining a difficult sample according to each associated query information includes:
determining a query information pair according to the associated query information;
Determining click information of the query information pair;
screening the query information pairs according to the click information to obtain effective query information pairs;
and generating a difficult sample according to the query information pair.
The method of A4, wherein obtaining a difficult sample according to each associated query information includes:
determining a current query text and a query result corresponding to the current query text according to the associated query information;
determining the click rate of each inquiry result;
determining an effective query result according to the click rate;
and generating a difficult sample according to the target query text and the effective query result.
A7, the method of A6, wherein the determining the effective query result according to the click rate comprises:
acquiring the number of query results corresponding to the current query text;
and determining effective query results according to the number of the query results and the click rate of the query results.
A8, training the initial semantic recall model according to the first training data to obtain a semantic model to be optimized, wherein the method comprises the following steps:
bringing the first training data into an initial semantic recall model to obtain semantic recall characterization;
calculating a loss value according to the semantic vector representation;
And adjusting the initial semantic recall model according to the loss value until the model converges to obtain the semantic model to be optimized.
A9, determining a recall problem according to the semantic vector set to be recalled to complete problem recall according to the method of A1, wherein the method comprises the following steps:
carrying out keyword analysis according to the target query problem to obtain keyword information;
obtaining an alternative recall problem according to the keyword information;
and determining a recall problem according to the semantic vector set and the alternative recall problem to complete the recall of the problem.
A10, determining a recall problem according to the semantic vector set to be recalled to complete problem recall according to the method of any one of A1 to A9, wherein the method comprises the following steps:
according to the semantic vector features and the semantic vector feature set to be recalled, matching a preset vector feature library to obtain a first to-be-recalled problem corresponding to the semantic vector features and a second to-be-recalled problem corresponding to the semantic vector feature set to be recalled;
and determining the recall problem according to the first to-be-recalled problem and the second to-be-recalled problem so as to complete the problem recall.
The application also discloses B11, a problem recall device, the problem recall device includes:
the determining module is used for determining semantic vector features corresponding to the target query problem;
The processing module is used for obtaining a corresponding semantic vector set to be recalled according to the semantic vector features;
and the processing module is also used for determining recall problems according to the semantic vector set to be recalled so as to complete problem recall.
And B12, the device as described in B11, wherein the processing module is further configured to input a preset target semantic recall model according to the semantic vector features to obtain a semantic vector set to be recalled.
B13, the apparatus of B12, the processing module is further configured to obtain first training data, where the first training data includes first sampling query data, first positive sample data, and first negative sample data;
training an initial semantic recall model according to the first training data to obtain a semantic model to be optimized;
determining second training data according to the first training data, wherein the second training data comprises second sampling query data, second positive sample data and second negative sample data, and the ratio of the second positive sample data to the second negative sample data is greater than or equal to the ratio of the first positive sample data to the first negative sample data;
training the semantic model to be optimized according to the second training data to obtain a target semantic recall model.
The device as described in B14, the processing module is further configured to determine a common sample according to a preset question-answer data set;
acquiring a sampling network address, wherein the sampling network address is a network address with a query function;
determining a plurality of associated query information according to the sampling network address information;
obtaining a difficult sample according to each piece of associated query information;
and generating first training data according to the common sample and the difficult sample.
B15, the device of B14, the processing module is further configured to determine a query information pair according to the associated query information;
determining click information of the query information pair;
screening the query information pairs according to the click information to obtain effective query information pairs;
and generating a difficult sample according to the query information pair.
B16, the device of B14, the processing module is further configured to determine a current query text and a query result corresponding to the current query text according to the associated query information;
determining the click rate of each inquiry result;
determining an effective query result according to the click rate;
and generating a difficult sample according to the target query text and the effective query result.
B17, the device as described in B16, the processing module is further configured to obtain the number of query results corresponding to the current query text;
And determining effective query results according to the number of the query results and the click rate of the query results.
B18, the device of B13, the processing module is further configured to bring the first training data into an initial semantic recall model to obtain a semantic recall representation;
calculating a loss value according to the semantic vector representation;
and adjusting the initial semantic recall model according to the loss value until the model converges to obtain the semantic model to be optimized.

Claims (10)

1. A question recall method, the question recall method comprising:
determining semantic vector features corresponding to the target query problem;
obtaining a corresponding semantic vector set to be recalled according to the semantic vector features;
and determining recall problems according to the semantic vector set to be recalled so as to complete problem recall.
2. The method of claim 1, wherein the obtaining the corresponding set of semantic vectors to recall from the semantic vector features comprises:
and inputting a preset target semantic recall model according to the semantic vector features to obtain a semantic vector set to be recalled.
3. The method of claim 2, wherein before inputting a preset target semantic recall model according to the semantic vector features to obtain a set of semantic vectors to be recalled, further comprising:
Acquiring first training data, wherein the first training data comprises first sampling query data, first positive sample data and first negative sample data;
training an initial semantic recall model according to the first training data to obtain a semantic model to be optimized;
determining second training data according to the first training data, wherein the second training data comprises second sampling query data, second positive sample data and second negative sample data, and the ratio of the second positive sample data to the second negative sample data is greater than or equal to the ratio of the first positive sample data to the first negative sample data;
training the semantic model to be optimized according to the second training data to obtain a target semantic recall model.
4. The method of claim 3, wherein prior to the acquiring the first training data, further comprising:
determining a common sample according to a preset question-answer data set;
acquiring a sampling network address, wherein the sampling network address is a network address with a query function;
determining a plurality of associated query information according to the sampling network address information;
obtaining a difficult sample according to each piece of associated query information;
and generating first training data according to the common sample and the difficult sample.
5. The method of claim 4, wherein obtaining a difficult sample from each of the associated query information comprises:
determining a query information pair according to the associated query information;
determining click information of the query information pair;
screening the query information pairs according to the click information to obtain effective query information pairs;
and generating a difficult sample according to the query information pair.
6. The method of claim 4, wherein obtaining a difficult sample from each of the associated query information comprises:
determining a current query text and a query result corresponding to the current query text according to the associated query information;
determining the click rate of each inquiry result;
determining an effective query result according to the click rate;
and generating a difficult sample according to the target query text and the effective query result.
7. The method of claim 6, wherein the determining valid query results from the click through rate comprises:
acquiring the number of query results corresponding to the current query text;
and determining effective query results according to the number of the query results and the click rate of the query results.
8. A question recall apparatus, the question recall apparatus comprising:
The determining module is used for determining semantic vector features corresponding to the target query problem;
the processing module is used for obtaining a corresponding semantic vector set to be recalled according to the semantic vector features;
and the processing module is also used for determining recall problems according to the semantic vector set to be recalled so as to complete problem recall.
9. A question recall apparatus, the apparatus comprising: a memory, a processor and a problem recall program stored on the memory and executable on the processor, the problem recall program configured to implement the steps of the problem recall method of any one of claims 1 to 7.
10. A storage medium having stored thereon a problem recall program which, when executed by a processor, implements the steps of the problem recall method of any one of claims 1 to 7.
CN202210057608.2A 2022-01-18 2022-01-18 Problem recall method, device, equipment and storage medium Pending CN116501831A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210057608.2A CN116501831A (en) 2022-01-18 2022-01-18 Problem recall method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210057608.2A CN116501831A (en) 2022-01-18 2022-01-18 Problem recall method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116501831A true CN116501831A (en) 2023-07-28

Family

ID=87325415

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210057608.2A Pending CN116501831A (en) 2022-01-18 2022-01-18 Problem recall method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116501831A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117093696A (en) * 2023-10-16 2023-11-21 浙江同花顺智能科技有限公司 Question text generation method, device, equipment and medium of large language model

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117093696A (en) * 2023-10-16 2023-11-21 浙江同花顺智能科技有限公司 Question text generation method, device, equipment and medium of large language model
CN117093696B (en) * 2023-10-16 2024-02-02 浙江同花顺智能科技有限公司 Question text generation method, device, equipment and medium of large language model

Similar Documents

Publication Publication Date Title
Yang et al. A LSTM based model for personalized context-aware citation recommendation
US11544474B2 (en) Generation of text from structured data
CN105183833B (en) Microblog text recommendation method and device based on user model
CN112328762A (en) Question and answer corpus generation method and device based on text generation model
CN112035599B (en) Query method and device based on vertical search, computer equipment and storage medium
CN110321437B (en) Corpus data processing method and device, electronic equipment and medium
Wang et al. Personalized recommendation for new questions in community question answering
CN110147494B (en) Information searching method and device, storage medium and electronic equipment
CN113342958B (en) Question-answer matching method, text matching model training method and related equipment
CN110990533A (en) Method and device for determining standard text corresponding to query text
CN115374362A (en) Multi-way recall model training method, multi-way recall device and electronic equipment
CN116595026A (en) Information inquiry method
CN110727769A (en) Corpus generation method and device, and man-machine interaction processing method and device
CN118132732A (en) Enhanced search user question and answer method, device, computer equipment and storage medium
CN113254671B (en) Atlas optimization method, device, equipment and medium based on query analysis
CN116501831A (en) Problem recall method, device, equipment and storage medium
CN117708270A (en) Enterprise data query method, device, equipment and storage medium
CN111581326B (en) Method for extracting answer information based on heterogeneous external knowledge source graph structure
CN116431912A (en) User portrait pushing method and device
CN114417863A (en) Word weight generation model training method and device and word weight generation method and device
CN110659419B (en) Method and related device for determining target user
CN114238798A (en) Search ranking method, system, device and storage medium based on neural network
Jeong et al. Label and context augmentation for response selection at DSTC8
JP2010282403A (en) Document retrieval method
CN116910232B (en) Astronomical literature search method and astronomical literature search method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination