CN117033612A - Text matching method, electronic equipment and storage medium - Google Patents

Text matching method, electronic equipment and storage medium Download PDF

Info

Publication number
CN117033612A
CN117033612A CN202311048339.4A CN202311048339A CN117033612A CN 117033612 A CN117033612 A CN 117033612A CN 202311048339 A CN202311048339 A CN 202311048339A CN 117033612 A CN117033612 A CN 117033612A
Authority
CN
China
Prior art keywords
text
target
preset
model
matching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311048339.4A
Other languages
Chinese (zh)
Other versions
CN117033612B (en
Inventor
李斯蕊
姜炜
刘丰
张丽颖
何凯
谭智隆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Travelsky Mobile Technology Co Ltd
Original Assignee
China Travelsky Mobile Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Travelsky Mobile Technology Co Ltd filed Critical China Travelsky Mobile Technology Co Ltd
Priority to CN202311048339.4A priority Critical patent/CN117033612B/en
Publication of CN117033612A publication Critical patent/CN117033612A/en
Application granted granted Critical
Publication of CN117033612B publication Critical patent/CN117033612B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a text matching method, electronic equipment and a storage medium, and relates to the field of text matching, wherein the method comprises the following steps: acquiring a question text A input by a target user; inputting the A to a first text matching module to match f matched texts through each text sequencing sub-model, and further obtaining a matched text list set H; inputting the H into a text recall sub-model, so that the text recall sub-model determines absolute confidence coefficient of each matched text in the H, and obtaining a matched text absolute confidence coefficient list TH corresponding to the H; according to TH, obtaining each matched text matched with A as a first target text to obtain a first target text set B1; the method and the device can ensure the accuracy of the first target text corresponding to the output problem text.

Description

Text matching method, electronic equipment and storage medium
Technical Field
The present invention relates to the field of text matching, and in particular, to a text matching method, an electronic device, and a storage medium.
Background
With the rapid development of network technology, the requirements of users on the acquisition of response texts of questions are higher and higher, and the accuracy of the acquisition of response texts is required while pursuing the acquisition of actual effects. Typically, the manner in which the user obtains the response text is: directly inputting a question in a related application search box on the terminal equipment, and generating a response text corresponding to the user question text by a language department model in the application according to the question input by the user; and the accuracy of the answer text generated according to the question text input by the user is lower at present.
Disclosure of Invention
Aiming at the technical problems, the application adopts the following technical scheme:
according to a first aspect of the present application, there is provided a text matching method, the method being applied to a preset text retrieval model, the text retrieval model comprising a preset text library belonging to a preset field and a first text matching module, the first text matching module comprising d text ranking sub-models and a text recall sub-model, the text ranking sub-model being capable of ranking each text in the text library according to a relative confidence level between a question text entered by a user and each text in the text library, the text recall sub-model being capable of determining an absolute confidence level between each text entered into the text recall sub-model and the question text; the method comprises the following steps:
s100, acquiring a question text A input by a target user;
s200, inputting A into a first text matching module to match f matched texts through each text sequencing submodel, thereby obtaining a matched text list set H= (H) 1 ,H 2 ,…,H c ,…,H d ) C=1, 2, …, d; wherein H is c A matching text list output by the c text matching module; h c =(H c,1 ,H c2 ,…,H c,e ,…,H c,f ) E=1, 2, …, f; wherein H is c,e Is H c The e-th matching text in the text;
S300, inputting H into a text recall sub-model, so that the text recall sub-model determines absolute confidence coefficient of each matched text in H, and a matched text absolute confidence coefficient list TH= (TH) corresponding to H is obtained 1 ,TH 2 ,…,TH x ,…,TH y ) X=1, 2, …, y; wherein TH is that x The absolute confidence coefficient of the matching text of the xth matching text in the TH, and y is the number of the absolute confidence coefficients of the matching text in the TH; y=d×f;
s400, according to TH, obtaining each matched text matched with A as a first target text to obtain a first target text set B1= (B1) 1 ,B1 2 ,…,B1 p ,…,B1 q ) P=1, 2, …, q; wherein B1 p For the p-th first target text matched with A, q is the number of the first target texts in B1; ηB1 p ≥η0,ηB1 p Is B1 p η0 is a preset absolute confidence threshold.
According to another aspect of the present application, there is also provided a non-transitory computer readable storage medium having stored therein at least one instruction or at least one program, the at least one instruction or the at least one program being loaded and executed by a processor to implement the above-described text matching method.
According to another aspect of the present application, there is also provided an electronic device comprising a processor and the above-described non-transitory computer-readable storage medium.
The invention has at least the following beneficial effects:
according to the text matching method, f matched texts are output by each text sequencing sub-model in the preset text retrieval model according to the problem text respectively, and a matched text list corresponding to each text sequencing sub-model is obtained; the text recall sub-model sorts the absolute confidence coefficient of each matched text output by each text sorting sub-model, and takes the matched text with the absolute confidence coefficient larger than a preset absolute confidence coefficient threshold value as a first target text matched with the problem text; because the absolute confidence coefficient can measure the matching degree between the question text and the matching text, the matching text with higher absolute confidence coefficient can be judged to be the response text corresponding to the question text, so that the accuracy of outputting the first target text corresponding to the question text is ensured.
Further, when each text sequencing sub-model sequences each text in the text library according to the problem text, the dimensions of the problem text according to the text in the text library are different, so that each matched text in any two matched text lists is not identical, and compared with a single text sequencing sub-model, the matched text output by the text sequencing sub-models improves the coverage rate of response texts corresponding to the problem text; when each text sequencing sub-model sequences each text in the text library according to the problem text, sequencing is carried out according to the relative confidence level of each text in the text library, so that each text sequencing sub-model can output f matched texts in the corresponding dimension; therefore, the number and types of the matched texts input to the text recall sub-model can be ensured to be more, and the accuracy of the first target text corresponding to the output problem text is further improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a text matching method according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.
It is noted that various aspects of the embodiments are described below within the scope of the following claims. It should be apparent that the aspects described herein may be embodied in a wide variety of forms and that any specific structure and/or function described herein is merely illustrative. Based on the present disclosure, one skilled in the art will appreciate that one aspect described herein may be implemented independently of any other aspect, and that two or more of these aspects may be combined in various ways. For example, an apparatus may be implemented and/or a method practiced using any number of the aspects set forth herein. In addition, such apparatus may be implemented and/or such methods practiced using other structure and/or functionality in addition to one or more of the aspects set forth herein.
A text matching method will be described with reference to a flowchart of the answer information determination method described in fig. 1.
The method is applied to a preset text retrieval model, the text retrieval model comprises a preset text library belonging to the preset field and a first text matching module, the first text matching module comprises d text sorting sub-models and a text recall sub-model, the text sorting sub-model can sort all texts in the text library according to the relative confidence between a problem text input by a user and all texts in the text library, and the text recall sub-model can determine the absolute confidence between all texts input to the text recall sub-model and the problem text.
In this embodiment, the preset text retrieval model includes a text library in a preset field, that is, a corpus; for example, a text library in the civil aviation field, wherein each text in the text library is a text related to the civil aviation and comprises a notice issued by each aviation driver and a history response text corresponding to a history problem text input by a user; the preset text retrieval model further comprises a first text matching module, wherein the first text matching module comprises d text sorting sub-models, such as a BM25 model, a BERT model, a text classification model and the like; also included is a text recall sub-model, e.g., simCSE model; it can be appreciated that the relative confidence is used to characterize the relative confidence between each matching text output by any text ranking sub-model, which is not capable of characterizing the true confidence between the matching text and the question text; while absolute confidence is used to characterize the true confidence between the matching text and the question text.
The text matching method comprises the following steps:
s100, acquiring a question text A input by a target user.
In this embodiment, it can be understood that a is a question text currently input by the target user; for example, "no pet can be carried by a flight? ".
S200, inputting A into a first text matching module to match f matched texts through each text sequencing submodel, thereby obtaining a matched text list set H= (H) 1 ,H 2 ,…,H c ,…,H d ) C=1, 2, …, d; wherein H is c A matching text list output by the c text matching module; h c =(H c,1 ,H c,2 ,…,H c,e ,…,H c,f ) E=1, 2, …, f; wherein H is c,e Is H c The e-th matching text in the text.
In this embodiment, after inputting a to the first text matching module, any text sorting sub-module can determine the relative confidence between each text in the text library and a according to a, then sort each text in the text library from large to small according to the confidence corresponding to each text, and take the first f texts as matching texts; thus, a matched text list output by each text sequencing submodel can be obtained.
It should be noted that, when each text sorting sub-model sorts each text in the text library according to the problem text, the dimensions of the problem text and the text in the text library are different, so that each matching text in any two matching text lists is not identical; when each text sequencing sub-model sequences each text in the text library according to the problem text, sequencing is carried out according to the relative confidence level of each text in the text library, so that each text sequencing sub-model can output f matched texts in the corresponding dimension; therefore, the number and types of the obtained matching texts are rich.
S300, inputting H into a text recall sub-model, so that the text recall sub-model determines absolute confidence coefficient of each matched text in H, and a matched text absolute confidence coefficient list TH= (TH) corresponding to H is obtained 1 ,TH 2 ,…,TH x ,…,TH y ) X=1, 2, …, y; wherein TH is that x The absolute confidence coefficient of the matching text of the xth matching text in the TH, and y is the number of the absolute confidence coefficients of the matching text in the TH; y=d×f.
In this embodiment, the text recall sub-model can score absolute confidence levels of the matching texts according to the semantics of the question text and the semantics of the matching text; according to the absolute confidence corresponding to each matching text, TH can be obtained; the text recall sub-model may select the simCSE model.
S400, according to TH, obtaining each matched text matched with A as a first target text to obtain a first target text set B1= (B1) 1 ,B1 2 ,…,B1 p ,…,B1 q ) P=1, 2, …, q; wherein B1 p For the p-th first target text matched with A, q is the number of the first target texts in B1; ηB1 p ≥η0,ηB1 p Is B1 p η0 is a preset absolute confidence threshold.
In this embodiment, for the matching texts with absolute confidence degrees greater than the preset absolute confidence degree threshold, it may be determined that the matching texts have higher correlation or similarity to the standard response text corresponding to the question text, and these matching texts are used as the first target text of the question text, so as to obtain B1; it should be noted that, the number of the first target texts in B1 is less than y, and B1 may be an empty set, and when B1 is an empty set, the absolute confidence coefficient of each matching text is less than a preset absolute confidence coefficient threshold, and the correlation degree or similarity between each matching text and the standard response text corresponding to the question text is low, so that each matching text cannot be used as the first target text of the question text; therefore, the situation that the matched text output by the text recall sub-model has larger semantic difference with the standard response text corresponding to the question text can be avoided.
According to the text matching method, f matched texts are output by each text sorting sub-model in the preset text retrieval model according to the problem text, and a matched text list corresponding to each text sorting sub-model is obtained; the text recall sub-model sorts the absolute confidence coefficient of each matched text output by each text sorting sub-model, and takes the matched text with the absolute confidence coefficient larger than a preset absolute confidence coefficient threshold value as a first target text matched with the problem text; because the absolute confidence coefficient can measure the matching degree between the question text and the matching text, the matching text with higher absolute confidence coefficient can be judged to be the response text corresponding to the question text, so that the accuracy of outputting the first target text corresponding to the question text is ensured.
Further, when each text sequencing sub-model sequences each text in the text library according to the problem text, the dimensions of the problem text according to the text in the text library are different, so that each matched text in any two matched text lists is not identical, and compared with a single text sequencing sub-model, the matched text output by the text sequencing sub-models improves the coverage rate of response texts corresponding to the problem text; when each text sequencing sub-model sequences each text in the text library according to the problem text, sequencing is carried out according to the relative confidence level of each text in the text library, so that each text sequencing sub-model can output f matched texts in the corresponding dimension; therefore, the number and types of the matched texts input to the text recall sub-model can be ensured to be more, and the accuracy of the first target text corresponding to the output problem text is further improved.
Optionally, f is determined by:
s210, obtaining the number of first target texts in each first target text set in a preset sliding time window W to obtain a first target text number set S= (S) in the first target text set in the current W 1 ,S 2 ,…,S u ,…,S v ) U=1, 2, …, v; wherein S is u The method comprises the steps of obtaining a first target text quantity in a first target text set of a u-th first target text set in the current W; v is the first target in the current WThe number of text sets; the end time of W is the current time.
In this embodiment, in the sliding time window W, there are a plurality of historical problem texts input by a plurality of users, each of the historical problem texts corresponds to a first target text set, and the number of first target texts in the first target text set corresponding to each of the historical problem texts in the sliding time window W can be obtained, so as to obtain S; the length of time of W is a preset value, for example, 72 hours; the end time of W is the current time, so that the number of the first target texts in each first target text set in the acquired S can be ensured to be up to date, and the number of the first target texts in the first target text set under the current condition can be accurately reflected; the step size of the W movement may be set to a preset value, for example, 10 minutes; thereby, the time interval for acquiring S is made to be 10 minutes, thereby reducing the calculation effort occupation.
S220, determining according to SWherein alpha is a preset proportionality coefficient, and alpha is more than 1; />Is a preset round-up function.
In this embodiment, the number of the first target texts in each first target text set in S is summed, and then, according to the number of the first target text sets in W, the average value of the number of the first target texts in each first target text set in W can be determined, where the average value can reflect the level of the number of the first target texts in the first target text set in the current situation; f is determined according to the average value and a preset proportion coefficient alpha; it can be understood that f is a dynamic value, and f can be changed according to the change of the number of the first target texts in each first target text set in the current W, so that the situation that the time consumption of matching the first target texts by the subsequent text recall sub-models is long or the number of the matching texts by the first text recall sub-models is too small and the first target texts matched by the subsequent text recall sub-models are inaccurate is avoided.
As an example, f may also be set directly to a fixed value; in practical application, when f is set to be 50, a better effect can be obtained.
In an exemplary embodiment, the text retrieval model further includes a second text matching module, the text library includes several texts with different text lengths, and the method further includes the following steps:
s500, inputting A to a second text matching module of a preset text retrieval model, so that the second text matching module matches a plurality of second target texts from the text library according to A to obtain a second target text set B2= (B2) 1 ,B2 2 ,…,B2 j ,…,B2 k ) J=1, 2, …, k; wherein B2 j Matching a j second target text from the text library according to A for a second text matching module, wherein k is the number of the second target texts in B2; B2B 2 j The text length of (2) is greater than a preset text length threshold; B1B 1 p The text length of the text is less than or equal to a preset text length threshold.
In this embodiment, the preset text retrieval model further includes a second text matching module, which may be understood as a long text matching module, for example, a distributed search engine (elastic search); the second text matching module can search out long texts matched with the problem text input by the user from all long texts contained in the text library; it can be understood that the long text searched by the second text matching module is a fixed text related to a preset field, for example, a bulletin issued by a navigation driver, etc.; when the problem input by the user is a problem aiming at the announcement of the navigation driver, the text matched by the first text matching module possibly does not contain the text related to the announcement of the navigation driver, namely long text; in this case, then, only the first text matching module is used to match the text, and it may not be possible to overlay the content related to the problematic text; therefore, the present embodiment sets the second text matching module to avoid the occurrence of the above-described problem.
S510, adding each first target text in the B1 and each second target text in the B2 into a preset list TA' to obtain a text list TA matched with the A; wherein the initial state of TA' is null.
In this embodiment, the TA includes both long text and short text, and regardless of the type of question text input by the user, the TA can fully cover the content of the response text corresponding to the question text input by the user, thereby improving the accuracy of generating the response text.
Optionally, after the step S510, the method further includes the steps of:
s520, splicing the A and the TA according to a preset text splicing template to generate a target question text QA corresponding to the A.
In this embodiment, after obtaining the TA, a problem text input by a user and each text matched by a text retrieval model need to be spliced through a preset text splicing template to form a target problem text corresponding to the problem text; it can be understood that the target question text includes the question text input by the user and a plurality of texts which are matched by the text retrieval model according to the question text and belong to the preset field.
For a preset splicing template, a preset json format structure body can be adopted, and different fields are used for corresponding to A and TA; and filling the A and the TA into the corresponding positions by adopting a preset text template to form a text with semantics.
Compared with the independent problem text, the spliced target problem text comprises a plurality of texts matched with the problem text in the preset field, and then the subsequent target text generation model can be combined with the plurality of texts matched with the problem text in the preset field to generate a response text; the single question text is input into a target text generation model, and the target text generation model can combine relevant texts in all fields to generate response texts, so that the generated response texts are wider or wrong response texts are generated; therefore, in the embodiment, the text matching the question text input by the user and the text retrieval model is spliced, so that the accuracy of generating the response text subsequently can be improved.
S530, inputting QA into a preset target text generation model so that the target text generation model outputs a response text matched with A according to QA.
In this embodiment, the preset target text generation model is obtained by training an original natural language processing model, and training samples used for training the original natural language processing model, that is, a corpus, which also belongs to a corpus in the preset field; therefore, the capability of the target text generation model for processing the problem text in the preset field can be improved, so that the perpendicularity of the response text corresponding to the problem text output by the target text generation model in the preset field is higher, and the accuracy of the generated response text is improved.
For a question text input by a user, firstly, inputting the question text into a preset text retrieval model, wherein the preset text retrieval model comprises a preset text library, and each text in the preset text library belongs to the preset field; therefore, the text matched by the text retrieval model according to the problem text is also the text belonging to the preset collar; then, splicing the question text and the text in the preset field matched with the question text according to a preset splicing template to obtain a target question text corresponding to the question text, wherein the target question text comprises the question text input by a user and the text matched with the question text in the preset field; inputting the target question text into a preset target text generation model, so that the target text generation model can output a response text matched with the question text input by a user according to the target question text; because the target question text comprises the question text input by the user and the text matched with the question text in the preset field, the target text matching model can combine the text matched with the question text in the preset field to generate the response text, so that the perpendicularity between the generated response text and the corresponding question text is improved, and the accuracy of generating the response text is improved.
Optionally, the preset target text generation model is obtained through the following steps:
s531, acquiring a preset first text generation model; the first text generation model is obtained by training a preset first training sample set for the initial text generation model, and the first training sample set comprises training samples in a plurality of fields.
In this embodiment, the preset first text generation model is obtained after training an initial natural language processing model, and the first text generation model may select an existing natural language processing model; it will be appreciated that the first text generation model is specific to users in all fields, the training samples used in training are from various fields, and the answer text generated for the question text input by the user is relatively broad in content.
S532, a second preset training sample set is obtained; the second training sample set comprises a plurality of training samples belonging to a target field, wherein the target field is one of a plurality of fields.
In this embodiment, since the first text generation model obtained in step S531 is a general-purpose natural language processing model, the processing capability of the answer text in the vertical domain is poor, and the content of the answer text generated may not be the content in the domain corresponding to the question text.
In view of the above, in this embodiment, the first text generation model is pre-trained, and the second training sample set used for the pre-training is obtained from the target field; for example, training samples, i.e., a corpus, used for pre-training are obtained from the civil aviation field, and the content of the training samples is related to the civil aviation.
And S533, training the first text generation model by using a second training sample set to obtain the target text generation model.
It can be appreciated that each training sample in the second training sample set is related to a target domain such that the first text generation model learns a significant amount of knowledge within the target domain; the target problem text input to the first text generation model is a text with semantics and is equivalent to an instruction; then, when the first text generation model processes the problem text in the target field, on one hand, the semantics of the target problem text can be understood, and on the other hand, the knowledge in the target field learned by the first text generation model can be combined, so that the accuracy of generating the response text by the first text generation model is greatly improved.
Optionally, the first text generation model includes a pre-training module and a fine tuning module, an output end of the pre-training module is connected to an input end of the fine tuning module, the pre-training module is used for pre-training the first text generation model, the fine tuning module is used for adjusting a text output by the pre-training module, and the step S533 includes the following steps:
S5331, acquiring a second training sample set; the second training sample set includes a second pre-training sample set corresponding to the pre-training module and a second fine tuning sample set corresponding to the fine tuning module.
S5332, inputting the second pre-training sample set to the pre-training module, inputting the second fine-tuning sample set to the fine-tuning module, and training the first text generation model to obtain the target text generation model.
In this embodiment, the first text generation model includes two parts, one part is a pre-training module, and the other part is a fine-tuning module; when training the first text generation model, a second pre-training sample set and a second fine-tuning sample set are required to be preset so as to train the pre-training module and the fine-tuning module respectively, so that the first text generation model achieves a good training effect.
In an exemplary embodiment, a method for generating a reply text is provided, the method including the steps of:
s610, acquiring a question text A input by a target user.
In this embodiment, it can be understood that a is a question text currently input by the target user; for example, "no pet can be carried by a flight? ".
S620, inputting A into a preset text retrieval model, so that the text retrieval model outputs a text list TA= (TA) matched with A according to A 1 ,TA 2 ,…,TA n ,…,TA m ) N=1, 2, …, m; wherein TA n The n-th text matched with A, and m is the number of texts matched with A; the text retrieval model comprises a preset text library and a text matching module, wherein the text matching module can match a plurality of texts matched with A from the text library according to A; each text in the text library belongs to the preset field.
In this embodiment, the preset text retrieval model includes a preset text library, i.e., a corpus; the text library belongs to a preset field, for example, a text library in the civil aviation field, wherein each text in the text library is a text related to the civil aviation and comprises announcements issued by each navigation driver and a history response text corresponding to a history problem text input by a user; the preset text retrieval model can retrieve a plurality of texts which are wanted to be matched with the question text from a preset text library according to the question text input by the user, so that TA can be obtained.
It should be noted that, for any problem input by the user, the preset text retrieval model can be matched with the text matched with the problem text, and the matched text is the text in the preset field; the text library in the preset field is set, so that the text matched by the preset text retrieval model according to the problem text is the text related to the problem text, and the text which is not related to the problem text in other fields is prevented from being matched, so that the accuracy of generating the follow-up response text is improved.
S630, splicing the A and the TA according to a preset text splicing template to generate a target question text QA corresponding to the A; wherein QA includes matching text within A and TA.
In this embodiment, after obtaining the TA, a problem text input by a user and each text matched by a text retrieval model need to be spliced through a preset text splicing template to form a target problem text corresponding to the problem text; it can be understood that the target question text includes the question text input by the user and a plurality of texts which are matched by the text retrieval model according to the question text and belong to the preset field.
For a preset splicing template, a preset json format structure body can be adopted, and different fields are used for corresponding to A and TA; and filling the A and the TA into the corresponding positions by adopting a preset text template to form a text with semantics.
Compared with the independent problem text, the spliced target problem text comprises a plurality of texts matched with the problem text in the preset field, and then the subsequent target text generation model can be combined with the plurality of texts matched with the problem text in the preset field to generate a response text; the single question text is input into a target text generation model, and the target text generation model can combine relevant texts in all fields to generate response texts, so that the generated response texts are wider or wrong response texts are generated; therefore, in the embodiment, the text matching the question text input by the user and the text retrieval model is spliced, so that the accuracy of generating the response text subsequently can be improved.
S640, inputting QA into a preset target text generation model, so that the target text generation model outputs a response text matched with A according to QA.
In this embodiment, the preset target text generation model is obtained by training an original natural language processing model, and training samples used for training the original natural language processing model, that is, a corpus, which also belongs to a corpus in the preset field; therefore, the capability of the target text generation model for processing the problem text in the preset field can be improved, so that the perpendicularity of the response text corresponding to the problem text output by the target text generation model in the preset field is higher, and the accuracy of the generated response text is improved.
For a question text input by a user, firstly, inputting the question text into a preset text retrieval model, wherein the preset text retrieval model comprises a preset text library, and each text in the preset text library belongs to the preset field; therefore, the text matched by the text retrieval model according to the problem text is also the text belonging to the preset collar; then, splicing the question text and the text in the preset field matched with the question text according to a preset splicing template to obtain a target question text corresponding to the question text, wherein the target question text comprises the question text input by a user and the text matched with the question text in the preset field; inputting the target question text into a preset target text generation model, so that the target text generation model can output a response text matched with the question text input by a user according to the target question text; because the target question text comprises the question text input by the user and the text matched with the question text in the preset field, the target text matching model can combine the text matched with the question text in the preset field to generate the response text, so that the perpendicularity between the generated response text and the corresponding question text is improved, and the accuracy of generating the response text is improved.
Optionally, the step S630 includes the following steps:
s631, acquiring a preset text splicing template; wherein the text splicing template comprises preset first text segments T which are sequentially arranged 1 Second text segment T 2 And a third text segment T 3 ,T 1 And T 2 A first text space W is preset between 1 ,T 2 And T 3 A second text space W is preset between 2
In this embodiment, the first text segment T 1 Second text segment T 2 And a third text segment T 3 A preset text segment with a certain semantic meaning; specifically T 1 For "please combine the following text: ", T 2 For help me find: ", T 3 An answer to "yes. ", T 1 And T 2 A first text space for filling in the text list is preset in between, T 2 And T 3 A second text space for filling in the question text input by the user is preset between the first text space and the second text space; t is the same as 1 At T 2 Before T 2 At T 3 Previously, it was thereby ensured that the final generated target question text was semantically correct.
In the present embodiment, W 1 And W is 2 The determination can be made by the following steps:
s6311, obtaining the text length of each text in the TA to obtain the total text length QTA of each text in the TA.
S6312, determining W from QTA 1 Text length of text that can be accommodated Wherein beta is 1 Is a preset first proportional coefficient, beta 1 >1。
S6313, obtaining the text length QA of A.
S6314, determining W according to QA 2 Text length of text that can be accommodatedWherein beta is 2 Is a preset second proportionality coefficient; beta 2 >1,/>Is a preset upward valued function.
W is determined by the method 1 And W is 2 Corresponding text length can be according to the text length and beta occupied by each text in A and TA 1 Beta 2 To dynamically set W 1 And W is 2 Is capable of avoiding W 1 And W is 2 Is set larger, resulting in larger generated target question text, and W 1 And W is 2 The space size setting of (2) is small, resulting in the occurrence of a situation that cannot be fully added to a and/or TA; beta 1 > 1, can ensure that each text within TA can be fully added to W 1 ;β 2 > 1, it can be ensured that A is completely added to W 2 ;β 1 The value range of (2) is 1.05-1.1, beta 2 The value of (2) is 1.1-1.3.
S632 adding TA to W 1 Inside, and add A to W 2 To generate QA.
In this embodiment, QA is a text with semantics, and the target text model can generate a response text corresponding to a according to the specific semantics of QA, that is, the target text generation model can combine the text in TA to generate the response text of a; thus, the accuracy of generating the response text can be improved.
Optionally, the step S630 includes the following steps:
s633, acquiring a preset text splicing template; the text splicing template comprises a preset first character string, a preset second character string and a preset third text space W associated with the first character string 3 And a fourth text space W associated with the second character string 4 The method comprises the steps of carrying out a first treatment on the surface of the The first character string is different from the second character string.
In this embodiment, the preset text splicing template is a structure body with a preset format, for example, a json format structure body; the first character string is used for representing a question text A input by a user, and the second character string is used for representing a text list TA matched with the A; when training the target text generation model, the meaning of the first character string and the meaning of the second character string are marked, so that the target text generation model can identify the meaning represented by the first character string and the second character string; the first string and the second string may take the form of shorter strings or single characters, e.g., the first string is "query: "or" q: "; the shorter first character string and the shorter second character string can save a part of space to add more texts, so that more information is contained in the target problem text, and further, the information which can be combined by the target text generation model is more abundant.
S634, adding A to W 3 And TA is added to W 4 To generate QA.
In this embodiment, the QA includes each text in a and TA, and the target text model can identify each text in a and TA in the QA, so that the target text generation model can combine the text in TA to generate a response text of a; thus, the accuracy of generating the response text can be improved.
Furthermore, although the steps of the methods in the present disclosure are depicted in a particular order in the drawings, this does not require or imply that the steps must be performed in that particular order or that all illustrated steps be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc.
Embodiments of the present invention also provide a non-transitory computer readable storage medium that may be disposed in an electronic device to store at least one instruction or at least one program for implementing one of the methods embodiments, the at least one instruction or the at least one program being loaded and executed by the processor to implement the methods provided by the embodiments described above.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The computer readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).
Embodiments of the present application also provide an electronic device comprising a processor and the aforementioned non-transitory computer-readable storage medium.
An electronic device according to this embodiment of the application. The electronic device is merely an example, and should not impose any limitations on the functionality and scope of use of embodiments of the present application.
The electronic device is in the form of a general purpose computing device. Components of an electronic device may include, but are not limited to: the at least one processor, the at least one memory, and a bus connecting the various system components, including the memory and the processor.
Wherein the memory stores program code that is executable by the processor to cause the processor to perform steps according to various exemplary embodiments of the application described in the "exemplary methods" section of this specification.
The storage may include readable media in the form of volatile storage, such as Random Access Memory (RAM) and/or cache memory, and may further include Read Only Memory (ROM).
The storage may also include a program/utility having a set (at least one) of program modules including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.
The bus may be one or more of several types of bus structures including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus architectures.
The electronic device may also communicate with one or more external devices (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device, and/or with any device (e.g., router, modem, etc.) that enables the electronic device to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface. And, the electronic device may also communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through a network adapter. The network adapter communicates with other modules of the electronic device via a bus. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with an electronic device, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a terminal device, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.
Embodiments of the present invention also provide a computer program product comprising program code for causing an electronic device to carry out the steps of the method according to the various exemplary embodiments of the invention as described in the specification, when said program product is run on the electronic device.
While certain specific embodiments of the invention have been described in detail by way of example, it will be appreciated by those skilled in the art that the above examples are for illustration only and are not intended to limit the scope of the invention. Those skilled in the art will also appreciate that many modifications may be made to the embodiments without departing from the scope and spirit of the invention. The scope of the invention is defined by the appended claims.

Claims (10)

1. The text matching method is characterized in that the method is applied to a preset text retrieval model, the text retrieval model comprises a preset text library belonging to the preset field and a first text matching module, the first text matching module comprises d text sorting sub-models and a text recall sub-model, the text sorting sub-model can sort all texts in the text library according to the relative confidence between a problem text input by a user and all texts in the text library, and the text recall sub-model can determine the absolute confidence between all texts input to the text recall sub-model and the problem text; the method comprises the following steps:
s100, acquiring a question text A input by a target user;
s200, inputting A into a first text matching module to match f matched texts through each text sequencing submodel, thereby obtaining a matched text list set H= (H) 1 ,H 2 ,…,H c ,…,H d ) C=1, 2, …, d; wherein H is c A matching text list output by the c text matching module; h c =(H c,1 ,H c,2 ,…,H c,e ,…,H c,f ) E=1, 2, …, f; wherein H is c,e Is H c The e-th matching text in the text;
s300, inputting the H into a text recall sub-model so that the text recall sub-model determines absolute positions of each matched text in the H Confidence, obtaining a matching text absolute confidence list TH= (TH) corresponding to H 1 ,TH 2 ,…,TH x ,…,TH y ) X=1, 2, …, y; wherein TH is that x The absolute confidence coefficient of the matching text of the xth matching text in the TH, and y is the number of the absolute confidence coefficients of the matching text in the TH; y=d×f;
s400, according to TH, obtaining each matched text matched with A as a first target text to obtain a first target text set B1= (B1) 1 ,B1 2 ,…,B1 p ,…,B1 q ) P=1, 2, …, q; wherein B1 p For the p-th first target text matched with A, q is the number of the first target texts in B1; ηB1 p ≥η0,ηB1 p Is B1 p η0 is a preset absolute confidence threshold.
2. The text matching method according to claim 1, wherein f is determined by:
s210, obtaining the number of first target texts in each first target text set in a preset sliding time window W to obtain a first target text number set S= (S) in the first target text set in the current W 1 ,S 2 ,…,S u ,…,S v ) U=1, 2, …, v; wherein S is u The method comprises the steps of obtaining a first target text quantity in a first target text set of a u-th first target text set in the current W; v is the number of first target text sets in the current W; the ending time of W is the current time;
s220, determining according to SWherein alpha is a preset proportionality coefficient, and alpha is more than 1; / >Is a preset round-up function.
3. The text matching method according to claim 1, wherein the text ranking sub-model comprises BM25, BERT; the text recall sub-model includes a simCSE model.
4. The text matching method according to claim 1, wherein the text retrieval model further comprises a second text matching module, the text library comprising a plurality of texts having different text lengths, the method further comprising the steps of:
s500, inputting A into a second text matching module of the text retrieval model, so that the second text matching module matches a plurality of second target texts from the text library according to A to obtain a second target text set B2= (B2) 1 ,B2 2 ,…,B2 j ,…,B2 k ) J=1, 2, …, k; wherein B2 j Matching a j second target text from the text library according to A for a second text matching module, wherein k is the number of the second target texts in B2; B2B 2 j The text length of (2) is greater than a preset text length threshold; B1B 1 p The text length of the text is smaller than or equal to a preset text length threshold;
s510, adding each first target text in the B1 and each second target text in the B2 into a preset list TA' to obtain a text list TA matched with the A; wherein the initial state of TA' is null.
5. The text matching method according to claim 4, characterized in that after said step S510, said method further comprises the steps of:
s520, splicing the A and the TA according to a preset text splicing template to generate a target question text QA corresponding to the A; wherein QA includes matching text within A and TA;
s530, inputting QA into a preset target text generation model so that the target text generation model outputs a response text matched with A according to QA.
6. The text matching method according to claim 5, wherein the target text generation model is obtained by:
s531, acquiring a preset first text generation model; the first text generation model is obtained by training a preset first training sample set for the initial text generation model, wherein the first training sample set comprises training samples in a plurality of fields;
s532, a second preset training sample set is obtained; the second training sample set comprises a plurality of training samples belonging to a target field, wherein the target field is one of a plurality of fields;
and S533, training the first text generation model by using a second training sample set to obtain the target text generation model.
7. The text matching method according to claim 6, wherein the first text generation model includes a pre-training module and a fine-tuning module, an output end of the pre-training module is connected to an input end of the fine-tuning module, the pre-training module is used for pre-training the first text generation model, the fine-tuning module is used for adjusting the text output by the pre-training module, and the step S533 includes the following steps:
s5331, acquiring a second training sample set; the second training sample set comprises a second pre-training sample set corresponding to the pre-training module and a second fine-tuning sample set corresponding to the fine-tuning module;
s5332, inputting the second pre-training sample set to the pre-training module, inputting the second fine-tuning sample set to the fine-tuning module, and training the first text generation model to obtain the target text generation model.
8. The text matching method according to claim 1, wherein f has a value in the range of 40 to 60.
9. A non-transitory computer readable storage medium having stored therein at least one instruction or at least one program, wherein the at least one instruction or the at least one program is loaded and executed by a processor to implement the text matching method of any of claims 1-8.
10. An electronic device comprising a processor and the non-transitory computer readable storage medium of claim 9.
CN202311048339.4A 2023-08-18 2023-08-18 Text matching method, electronic equipment and storage medium Active CN117033612B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311048339.4A CN117033612B (en) 2023-08-18 2023-08-18 Text matching method, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311048339.4A CN117033612B (en) 2023-08-18 2023-08-18 Text matching method, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN117033612A true CN117033612A (en) 2023-11-10
CN117033612B CN117033612B (en) 2024-06-04

Family

ID=88627859

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311048339.4A Active CN117033612B (en) 2023-08-18 2023-08-18 Text matching method, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117033612B (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030217052A1 (en) * 2000-08-24 2003-11-20 Celebros Ltd. Search engine method and apparatus
CN109597987A (en) * 2018-10-25 2019-04-09 阿里巴巴集团控股有限公司 A kind of text restoring method, device and electronic equipment
CN110503958A (en) * 2019-08-30 2019-11-26 厦门快商通科技股份有限公司 Audio recognition method, system, mobile terminal and storage medium
CN112163405A (en) * 2020-09-08 2021-01-01 北京百度网讯科技有限公司 Question generation method and device
CN112508011A (en) * 2020-12-02 2021-03-16 上海逸舟信息科技有限公司 OCR (optical character recognition) method and device based on neural network
CN113159187A (en) * 2021-04-23 2021-07-23 北京金山数字娱乐科技有限公司 Classification model training method and device, and target text determining method and device
CN114971730A (en) * 2022-06-02 2022-08-30 广州欢聚时代信息科技有限公司 Method for extracting file material, device, equipment, medium and product thereof
CN115146021A (en) * 2021-03-30 2022-10-04 北京三快在线科技有限公司 Training method and device for text retrieval matching model, electronic equipment and medium
WO2023045184A1 (en) * 2021-09-26 2023-03-30 平安科技(深圳)有限公司 Text category recognition method and apparatus, computer device, and medium
CN116383366A (en) * 2023-06-06 2023-07-04 中航信移动科技有限公司 Response information determining method, electronic equipment and storage medium
CN116401345A (en) * 2023-03-09 2023-07-07 北京海致星图科技有限公司 Intelligent question-answering method, device, storage medium and equipment

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030217052A1 (en) * 2000-08-24 2003-11-20 Celebros Ltd. Search engine method and apparatus
CN109597987A (en) * 2018-10-25 2019-04-09 阿里巴巴集团控股有限公司 A kind of text restoring method, device and electronic equipment
CN110503958A (en) * 2019-08-30 2019-11-26 厦门快商通科技股份有限公司 Audio recognition method, system, mobile terminal and storage medium
CN112163405A (en) * 2020-09-08 2021-01-01 北京百度网讯科技有限公司 Question generation method and device
CN112508011A (en) * 2020-12-02 2021-03-16 上海逸舟信息科技有限公司 OCR (optical character recognition) method and device based on neural network
CN115146021A (en) * 2021-03-30 2022-10-04 北京三快在线科技有限公司 Training method and device for text retrieval matching model, electronic equipment and medium
CN113159187A (en) * 2021-04-23 2021-07-23 北京金山数字娱乐科技有限公司 Classification model training method and device, and target text determining method and device
WO2023045184A1 (en) * 2021-09-26 2023-03-30 平安科技(深圳)有限公司 Text category recognition method and apparatus, computer device, and medium
CN114971730A (en) * 2022-06-02 2022-08-30 广州欢聚时代信息科技有限公司 Method for extracting file material, device, equipment, medium and product thereof
CN116401345A (en) * 2023-03-09 2023-07-07 北京海致星图科技有限公司 Intelligent question-answering method, device, storage medium and equipment
CN116383366A (en) * 2023-06-06 2023-07-04 中航信移动科技有限公司 Response information determining method, electronic equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王子斌等: ""基于知识图谱与BERT的安全领域汉字文本纠错模型"", 《计算机应用》, 30 June 2023 (2023-06-30), pages 1 - 10 *

Also Published As

Publication number Publication date
CN117033612B (en) 2024-06-04

Similar Documents

Publication Publication Date Title
US11238845B2 (en) Multi-dialect and multilingual speech recognition
CN107491547B (en) Search method and device based on artificial intelligence
US10733197B2 (en) Method and apparatus for providing information based on artificial intelligence
CN113495900B (en) Method and device for obtaining structured query language statement based on natural language
CN109376222B (en) Question-answer matching degree calculation method, question-answer automatic matching method and device
CN111428010B (en) Man-machine intelligent question-answering method and device
CN110276023B (en) POI transition event discovery method, device, computing equipment and medium
US11580299B2 (en) Corpus cleaning method and corpus entry system
CN110334209B (en) Text classification method, device, medium and electronic equipment
CN111611452B (en) Method, system, equipment and storage medium for identifying ambiguity of search text
CN111753167B (en) Search processing method, device, computer equipment and medium
CN114840671A (en) Dialogue generation method, model training method, device, equipment and medium
US11604929B2 (en) Guided text generation for task-oriented dialogue
US10592514B2 (en) Location-sensitive ranking for search and related techniques
CN111611355A (en) Dialog reply method, device, server and storage medium
CN109977292B (en) Search method, search device, computing equipment and computer-readable storage medium
CN112926308B (en) Method, device, equipment, storage medium and program product for matching text
CN116541493A (en) Interactive response method, device, equipment and storage medium based on intention recognition
CN111339424A (en) Method, device and equipment for searching based on keywords and storage medium
CN116757224A (en) Intent understanding method, apparatus, device, and medium
CN111445271A (en) Model generation method, and prediction method, system, device and medium for cheating hotel
CN117033612B (en) Text matching method, electronic equipment and storage medium
CN117033613B (en) Response text generation method, electronic equipment and storage medium
CN111368036B (en) Method and device for searching information
CN114254634A (en) Multimedia data mining method, device, storage medium and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant