CN113407814B - Text searching method and device, readable medium and electronic equipment - Google Patents

Text searching method and device, readable medium and electronic equipment Download PDF

Info

Publication number
CN113407814B
CN113407814B CN202110726639.8A CN202110726639A CN113407814B CN 113407814 B CN113407814 B CN 113407814B CN 202110726639 A CN202110726639 A CN 202110726639A CN 113407814 B CN113407814 B CN 113407814B
Authority
CN
China
Prior art keywords
text
target
training sample
vector
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110726639.8A
Other languages
Chinese (zh)
Other versions
CN113407814A (en
Inventor
王鑫宇
张永华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Douyin Vision Co Ltd
Original Assignee
Douyin Vision Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Douyin Vision Co Ltd filed Critical Douyin Vision Co Ltd
Priority to CN202110726639.8A priority Critical patent/CN113407814B/en
Publication of CN113407814A publication Critical patent/CN113407814A/en
Priority to PCT/CN2022/090994 priority patent/WO2023273598A1/en
Application granted granted Critical
Publication of CN113407814B publication Critical patent/CN113407814B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9532Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure relates to a text searching method, a text searching device, a readable medium and electronic equipment. The method comprises the following steps: dividing a first text to be searched according to a plurality of preset dividing modes to obtain a plurality of groups of target divided text sets; inputting the multiple groups of target division text sets into a pre-trained text vector model to obtain a first text vector; obtaining a target text vector from second text vectors of a pre-established text knowledge base according to the first text vector, wherein the text knowledge base comprises one or more second text vectors and second texts corresponding to the second text vectors; and taking the second text corresponding to the target text vector as a target search result, and displaying the target search result. Therefore, the accuracy of text searching can be improved, so that the occurrence of incorrect or incomplete search results caused by misspelling of a user is avoided, and the experience of the user is improved.

Description

Text searching method and device, readable medium and electronic equipment
Technical Field
The present disclosure relates to the field of artificial intelligence, and in particular, to a text searching method, apparatus, readable medium and electronic device.
Background
With the explosive growth of internet content, how to search for a required text from massive network information becomes a focus of attention of information processing technology, such as searching articles, lyrics and web pages. The search engine can search according to the text to be searched input by the user in a text searching mode to obtain a search result matched with the text to be searched. For text searching, the related technology is generally performed in a reverse index-based mode, but the reverse index-based mode is difficult to accurately match with the search result expected by the user in certain scenes, so that the situation of error or incompleteness of the search result can occur, and the user experience is reduced.
Disclosure of Invention
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
In a first aspect, the present disclosure provides a text search method, the method comprising:
dividing a first text to be searched according to a plurality of preset dividing modes to obtain a plurality of groups of target divided text sets;
Inputting the multiple groups of target division text sets into a pre-trained text vector model to obtain a first text vector;
obtaining a target text vector from second text vectors of a pre-established text knowledge base according to the first text vector, wherein the text knowledge base comprises one or more second text vectors and second texts corresponding to the second text vectors;
and taking the second text corresponding to the target text vector as a target search result, and displaying the target search result.
In a second aspect, the present disclosure provides a text search apparatus, the apparatus comprising:
the first text dividing module is used for dividing the first text to be searched according to a plurality of preset dividing modes to obtain a plurality of groups of target divided text sets;
the first text vector acquisition module is used for inputting the multiple groups of target division text sets into a pre-trained text vector model to obtain a first text vector;
the target text vector acquisition module is used for acquiring a target text vector from second text vectors of a pre-established text knowledge base according to the first text vector, wherein the text knowledge base comprises one or more second text vectors and second texts corresponding to the second text vectors;
And the target text searching module is used for taking the second text corresponding to the target text vector as a target searching result and displaying the target searching result.
In a third aspect, the present disclosure provides a computer readable medium having stored thereon a computer program which when executed by a processing device performs the steps of the method of the first aspect of the present disclosure.
In a fourth aspect, the present disclosure provides an electronic device comprising:
a storage device having a computer program stored thereon;
processing means for executing said computer program in said storage means to carry out the steps of the method of the first aspect of the disclosure.
By adopting the technical scheme, the first text to be searched is divided according to a plurality of preset dividing modes, so as to obtain a plurality of groups of target divided text sets; inputting the multiple groups of target division text sets into a pre-trained text vector model to obtain a first text vector; obtaining a target text vector from second text vectors of a pre-established text knowledge base according to the first text vector, wherein the text knowledge base comprises one or more second text vectors and second texts corresponding to the second text vectors; and taking the second text corresponding to the target text vector as a target search result, and displaying the target search result. Therefore, the situation of misspelling of the user can be effectively avoided through the plurality of groups of target division text sets, and the true expectation of the user is reflected; and then searching through a pre-trained text vector model, so that the accuracy of text searching can be improved, the occurrence of incorrect or incomplete search results caused by misspelling of a user is avoided, and the user experience is improved.
Additional features and advantages of the present disclosure will be set forth in the detailed description which follows.
Drawings
The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale. In the drawings:
FIG. 1 is a flowchart illustrating a text search method according to an exemplary embodiment;
FIG. 2 is a schematic diagram of a text vector model, shown in accordance with an exemplary embodiment;
FIG. 3 is a training method of a text vector model, according to an exemplary embodiment;
FIG. 4 is a block diagram of a text search device shown in accordance with an exemplary embodiment;
FIG. 5 is a block diagram of a training apparatus for a text vector model, shown in accordance with an exemplary embodiment;
FIG. 6 is a block diagram of another training apparatus for text vector models, shown according to an exemplary embodiment;
FIG. 7 is a block diagram of another text search device shown in accordance with an exemplary embodiment;
Fig. 8 is a block diagram of an electronic device, according to an example embodiment.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.
It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.
The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.
It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.
It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.
The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.
First, an application scenario of the present disclosure will be described. The present disclosure may be applied to text search scenarios, such as search scenarios for articles, lyrics, web pages, and the like. The reverse index method adopted in the text searching in the related art comprises the steps of accurately searching a document containing a text to be searched input by a user in a database, and after sorting according to the frequency of the text to be searched in the document, obtaining the matched document as a search result. However, when the text to be searched input by the user deviates from the words in the database, for example, the user may inadvertently cause misspellings or nonstandard situations in the text to be searched, and the reverse index method is difficult to accurately match to the search results expected by the user, which may cause the occurrence of misspellings or incompleteness situations of the search results, and reduce the user experience.
In order to solve the problems, the disclosure provides a text searching method, a device, a readable medium and an electronic device, wherein a first text to be searched is divided according to a plurality of preset dividing modes to obtain a plurality of groups of target divided text sets, and the plurality of groups of target divided text sets can effectively avoid the situation of misspelling of a user and reflect the true expectation of the user; and then searching through a pre-trained text vector model, so that the accuracy of text searching can be improved, the occurrence of incorrect or incomplete search results caused by misspelling of a user is avoided, and the user experience is improved.
Specific embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.
Fig. 1 is a diagram illustrating a text search method according to an exemplary embodiment, as shown in fig. 1, the method including:
and 101, dividing the first text to be searched according to a plurality of preset dividing modes to obtain a plurality of groups of target divided text sets.
The first text to be searched may be text input by a user; the preset dividing mode can be any one of sentence dividing, word dividing and character dividing, and through the preset dividing mode, a text sentence set, a text word set and a text character set corresponding to the first text can be obtained, wherein any one of the text sentence set, the text word set and the text character set can be used as a target dividing text set.
It should be noted that, the multiple groups of target division text sets obtained through the steps can effectively avoid the situation of misspelling of the user, and the true expectation of the user is shown. For example: the text which the user desires to search is newyork in september, but because of misspelling when the user inputs, the first text to be searched is newyork in septmber, if the first text is processed only in a word segmentation mode, the result obtained after word segmentation is newyork, in, septmber or newyork, in, sept, m, ber, and thus, the corresponding text cannot be searched by accurately searching in an inverted index mode. However, in the manner of this embodiment, sentence division, word division, and character division may be performed on the first text "newyork in septmber", and the sentence division may divide the first text into a whole sentence "newyork in septmber" as a text sentence set corresponding to the first text; word division divides the first text into "newyork, in, septmber" or "newyork, in, sept, m, ber" as a text word set corresponding to the first text; character division may divide the first text into individual letters "n, e, w, y, o, r, k, i, n, s, e, p, t, m, b, e, r" as a set of text characters corresponding to the first text. In this way, although the first text input by the user has misspelling, the divided multiple groups of target divided text sets still have a certain similarity with the expected result of the user under the combined action of sentences, words and characters, especially under the action of characters, so that the actual expectation of the user can be shown in the search.
Step 102, inputting the multiple groups of target division text sets into a pre-trained text vector model to obtain a first text vector.
In the same example, the text sentence set, the text word set and the text character set corresponding to the first text may be input into a pre-trained text vector model, where the text vector model encodes the text sentence set, the text word set and the text character set to obtain the first text vector.
And step 103, acquiring a target text vector from a second text vector of a pre-established text knowledge base according to the first text vector.
The text knowledge base comprises one or more second text vectors and second texts corresponding to the second text vectors.
In this step, a target text vector may be obtained according to the similarity between the first text vector and the second text vector, for example, the second text vector having a similarity with the first text vector greater than or equal to a first preset similarity threshold may be used as the target text vector; or ordering the similarity between the second text vector and the first text vector from large to small, and taking the first preset number of second text vectors with the similarity ordered at the front as target text vectors.
And 104, taking the second text corresponding to the target text vector as a target search result, and displaying the target search result.
By adopting the method, the first text to be searched is divided according to a plurality of preset dividing modes, and a plurality of groups of target divided text sets are obtained; inputting the multiple groups of target division text sets into a pre-trained text vector model to obtain a first text vector; obtaining a target text vector from second text vectors of a pre-established text knowledge base according to the first text vector, wherein the text knowledge base comprises one or more second text vectors and second texts corresponding to the second text vectors; and taking the second text corresponding to the target text vector as a target search result, and displaying the target search result. Therefore, the situation of misspelling of the user can be effectively avoided through the plurality of groups of target division text sets, and the true expectation of the user is reflected; and then searching through a pre-trained text vector model, so that the accuracy of text searching can be improved, the occurrence of incorrect or incomplete search results caused by misspelling of a user is avoided, and the user experience is improved.
In another embodiment of the present disclosure, considering that the first text input by the user during searching is typically a small number of keywords or a small number of keywords, and is rarely continuous large text, so that the sentence division effect on the first text is not obvious, and therefore, the preset division manner may include word division and character division, and the multiple target division text sets may also include a first target division text set and a second target division text set; the step 101 of dividing the first text to be searched according to a plurality of preset dividing modes to obtain a plurality of groups of target divided text sets may include the following two steps:
step one, word division is carried out on a first text to obtain a first target division text set containing one or more target words.
The word division mode may be word segmentation (Token), and the target word obtained by the word segmentation mode may be called Token, where the target word may be any one or more of an english word, a chinese word, a number and a punctuation.
In this step, the above-described first text may be segmented using various methods, for example, a dictionary-based segmentation method, a statistical-based machine learning segmentation method, an understanding-based segmentation method, and the like.
Further, the language of the first text may be determined first, and the first text may be divided according to the language of the first text using the word division mode corresponding to the language. For example, if the language of the first text is chinese, a dictionary-based word segmentation method, a statistical-based machine learning word segmentation method, or a combination of both may be used to improve word segmentation accuracy. If the language of the first text is english, a simple word segmentation method based on space and punctuation division may be used, and of course, english may also be used in a word segmentation method based on a dictionary or a machine learning word segmentation method based on statistics. Through the word division, the first text can be divided into target words taking English words and/or Chinese words as units, so that the content of text information and the meaning which the text wants to express can be analyzed more intensively.
For example, if the first text is "what do you mean", the following four target words may be obtained through word segmentation: what, do, you, mean, the four target terms may be combined into the first target divided text set described above. If the first text is "I are Chinese", the following three target words can be obtained through word division: i, yes, chinese, these three target words can also be combined into the first target divided text set described above.
And secondly, performing character division on the first text to obtain a second target division text set containing one or more target characters.
Wherein the target character is any one of letters, chinese characters, numbers and punctuation marks.
For example, if the first text is "what do you mean? By character division, the following fourteen target characters can be obtained: w, h, a, t, d, o, y, o, u, m, e, a, n,? These fourteen target characters may be combined into the second target divided text set described above. If the first text is "I are 1 Chinese", the following seven target characters can be obtained through character division: i, yes, 1, one, middle, country, person, and likewise, the seven target characters may be combined into the second target divided text set described above.
It should be noted that, the step one and the step two may be performed in series according to different orders, or may be performed in parallel, where the parallel execution may improve the efficiency of dividing the first text.
Thus, the first set of target divided texts and the second set of target divided texts can be input into a pre-trained text vector model to obtain the first text vector. Since the first text is divided according to the word and the character as units, a more accurate first text vector can be obtained.
Further, the text vector model may include a word encoding network and a character encoding network; the text vector model may be used to:
the word coding network is used for coding the first target divided text set to obtain word vectors; encoding the second target divided text set through the character encoding network to obtain a character vector; and calculating a first text vector according to the word vector and the character vector.
The word encoding network and the character encoding network may be machine learning models, and may be one or more of a convolutional neural network (Convolutional Neural Networks, CNN) model or a Long-short-term memory model (Long-Short Term Memory, LSTM) model trained in advance, for example.
Fig. 2 is a schematic diagram of a text vector model according to an exemplary embodiment, and as shown in fig. 2, the text vector model includes a word encoding network 201 and a character encoding network 202, and is described below by taking a first text "I'm on the run it's a state of mind" as an example:
firstly, carrying out word division on a first text 'I'm on the run it's a state of mind' to obtain a first target division text set containing nine target words (I'm, on, the, run, it's, a, state, of, mind); the above character division is performed in parallel on the first text "I'm on the run it's a state of mind" resulting in a second set of target divided text containing 27 target characters (I, ', m, o, n, t, h, e, r, u, n, I, t,', s, a, s, t, a, t, e, o, f, m, I, n, d).
Then, inputting the first target divided text set into a word coding network 201, and obtaining a 64-dimensional word vector after coding; the second target divided text set is input into the character encoding network 202, and the character vector of 64 dimensions is obtained after encoding.
Finally, the word vector and the character vector are respectively normalized and then added to obtain a 128-dimensional first text vector.
Note that the word vector and the character vector may be 128-dimensional vectors or 256-dimensional vectors, and similarly, the first text vector may be 256-dimensional vectors or 512-dimensional vectors, which is not limited in the present disclosure.
In this way, the first text is divided according to the terms and characters, and the text vector model comprising the term coding network and the character coding network is input, so that a more accurate first text vector can be obtained, and particularly when the terms input by a user have misspellings, the first text vector more conforming to the expectations of the user can be obtained by means of the fusion of the terms and the characters; and comparing the first text vector with the second text vector to obtain a target text vector and a second text corresponding to the target text vector, thereby improving the accuracy of text search and further improving the experience of users.
Further, the word encoding network may capture the sequence of words, that is, after the same words are arranged in different sequences, the word encoding network is input, and the obtained word vectors are different. The target words in the first target divided text set can be ordered according to the sequence of the target words in the first text, and more accurate word vectors can be obtained after the word encoding network is input; likewise, the character encoding network may capture the front-to-back sequence of the characters, the target characters in the second target divided text set may be ordered according to the sequence in which the target characters appear in the first text, and a more accurate character vector may be obtained after the input of the character encoding network.
Thus, the text vector model can generate a more accurate first text vector according to the sequence of words or characters, so that the accuracy of text searching is further improved.
FIG. 3 is a training method for a text vector model, shown in FIG. 3, pre-trained by:
step 301, acquiring a training sample set.
The training sample set comprises a plurality of training sample pairs and similarity of each training sample pair.
In this step, the obtaining manner of the training sample set may include the following steps:
first, the similarity between each of the history search results and the history search information may be determined based on the history operation behavior information that the user implements on the plurality of history search results, respectively.
The historical search results are obtained by searching according to the historical search information input by the user. The historical search information may include text to be searched entered by the user during a historical period (e.g., the past week or month); according to the text to be searched, a plurality of historical search results can be searched. For example, the text to be searched input by the user is A, and n pieces of historical search results including A1 to An can be searched.
The historical operational behavior information may be used to characterize whether a user performed an operation on the historical search results.
The historical operational behavior information may include any one of user click behavior information or user browse behavior information, for example. For example, if the user clicks on the historical search results, or the user browses the historical search results, the user is deemed to have performed an operation on the historical search results.
Further, the historical operation behavior information may also include user click behavior information and user browsing behavior. For example, if the user clicks on the historical search result and browses for a first preset duration on the page corresponding to the historical search result, the user is considered to perform an operation on the historical search result. The first preset duration may be any duration between 5 seconds and 2 minutes, for example 20 seconds or 30 seconds, which is preset.
It should be noted that, in the case where the history search result matches the search intention of the user, the user may perform an operation on the history search result, so that the history operation behavior information performed on the history search result by the user may reflect the similarity between the history search result and the history search information to some extent. The historical search results with more user implementation operations are more consistent with the search intention of the user, namely the correlation degree between the historical search results and the historical search information is relatively higher. In this embodiment, the similarity between each historical search result and the historical search information may be determined according to the historical operation behavior information implemented by the user on the historical search result, without manually labeling the similarity by the user.
For example, the similarity between the historical search results and the historical search information may be determined according to the following formula:
S=C/H;
wherein S represents the similarity between the historical search results and the historical search information, C represents the number of times the historical operation behavior is implemented on the historical search results by the user, and H represents the number of times the search engine searches according to the historical search information and displays the historical search results to the user. The similarity obtained by this formula may be any value between 0 and 1.
Then, the historical search information and the historical search result are used as the training sample pair according to the similarity.
For example, historical search information and historical search results having a similarity greater than or equal to a second preset similarity threshold may be selected as the training sample pair. The second preset similarity threshold may be any value between 0.3 and 1, for example, may be 0.4 or 0.7.
And for the historical search information and the historical search result with the similarity smaller than the second preset similarity threshold value, the historical search information and the historical search result are not used as training sample pairs.
Further, since the search behavior of the user is frequent, the similarity between the history search result and the history search information acquired at different times is varied, and therefore, the similarity range between the history search result and the history search information, which can characterize the maximum value and the minimum value of the similarity between the history search result and the history search information, can be set according to the similarity between the history search result and the history search information acquired a plurality of times at different times. Thus, the similarity range can be regarded as the similarity of the training sample pair.
By way of example, the following 5 similarity ranges may be set: (0.84,1) a minimum value of 0.84 and a maximum value of 1 representing the degree of similarity between the history search result and the history search information; (0.68,0.9) a minimum value of 0.68 and a maximum value of 0.9 representing the degree of similarity between the history search result and the history search information; (0.54,0.8) a minimum value of 0.54 and a maximum value of 0.8 representing the degree of similarity between the history search result and the history search information; (0.48,0.7) a minimum value of 0.48 and a maximum value of 0.7 representing the degree of similarity between the history search result and the history search information; (0.4,0.6) the minimum value representing the similarity between the history search result and the history search information is 0.4 and the maximum value is 0.6.
Finally, a set of a plurality of the training sample pairs is used as the training sample set.
In this way, through the historical search information, the historical search result and the historical operation behavior information of the user, enough training sample pairs can be obtained, so that a text vector model is obtained through training.
Step 302, determining a first loss function and/or a second loss function according to the training sample pairs in the training sample set.
The first loss function is used for constraining the similarity of each training sample pair to meet a preset correlation similarity requirement, and the preset correlation similarity requirement can be equal to the corresponding similarity of the training sample pair, or can be within the corresponding similarity range of the training sample pair, that is, the minimum value of the similarity is larger than or equal to the minimum value of the similarity, and the maximum value of the similarity is smaller than or equal to the maximum value of the similarity. Under the condition that the preset correlation similarity requirement is a similarity range corresponding to the training sample pair, the first loss function can be determined according to the training sample pair and the similarity of the training sample pair. For example, the expression of the first loss function may include:
Figure BDA0003138915830000132
wherein L is 1 The value representing the first loss function, N representing the total number of training sample pairs, text i' representing the ith training sample pair, LB identifying the minimum similarity value for the training sample pair, UB identifying the maximum similarity value for the training sample pair.
The second loss function is used to restrict the similarity between the text in each training sample pair and the text in other training sample pairs in the training sample set to meet a preset non-relevant similarity requirement, where the preset non-relevant similarity requirement may be infinitely close to 0, or may be less than or equal to a third preset similarity threshold. In this way, the second loss function may be determined from a plurality of training sample pairs in the training sample set. For example, the expression of the second loss function may include:
Figure BDA0003138915830000131
wherein L is 2 The value of the second loss function is represented, N represents the total number of training sample pairs, text i 'represents the ith training sample pair, text j' represents the jth training sample pair, where i and j are unequal.
Step 303, training a preset model according to the training sample set, the first loss function and the second loss function to obtain the text vector model.
Likewise, the pre-set model may be one or more of a convolutional neural network (Convolutional Neural Networks, CNN) model or a Long-short term memory model (Long-Short Term Memory, LSTM).
It should be noted that, the first loss function and the second loss function are selectable, and either one of them or both of them may be used in training.
In this way, in the training process, the texts of the training sample pairs are used as positive examples through the first loss function, the texts of different training sample pairs in the training sample set are used as negative examples through the second loss function, so that the text vector model obtained through training can achieve higher similarity of vectors generated between similar texts, and lower similarity of vectors generated between dissimilar texts, and the accuracy of text searching is further improved.
In another embodiment of the present disclosure, the second text includes a text sentence and/or a document; the text knowledge base is pre-established by the following method:
firstly, dividing text sentences in the second text according to the plurality of preset dividing modes to obtain a plurality of groups of target sentence divided text sets.
Likewise, the preset division manner may include word division and character division, and the plurality of target division text sets may include a first target division text set and a second target division text set. The step may include: word division is carried out on the first text to obtain a first target division text set containing one or more target words; and performing character division on the first text to obtain a second target division text set containing one or more target characters. Wherein the target character is any one of letters, chinese characters, numbers and punctuation marks.
And then, inputting the multiple groups of target sentence dividing text sets into the text vector model to obtain a second text vector corresponding to the text sentence.
Likewise, the text vector model may also include word encoding networks and character encoding networks.
And finally, establishing the text knowledge base according to the second text vector and the text sentence.
For example, the correspondence between the second text vector and the text sentence (i.e., the second text) may be established, so that in the text knowledge base, the corresponding second text may be obtained conveniently according to the second text vector.
In this way, by the method, the text knowledge base containing the second text and the second text vector can be constructed, so that the text knowledge base can be searched for the corresponding text according to the text to be searched which is input by the user, and the text knowledge base is displayed to the user.
Further, for facilitating retrieval, a vector index may be added to the second text vector by a hierarchical navigable small world HNSW algorithm, thereby improving the efficiency of text searching.
Further, in the case where the second text includes a text sentence and a document, the text sentence may be a sentence obtained by subjecting the document to sentence segmentation.
The document can be an article, lyrics or a webpage, and the sentence can be segmented in various manners, for example, segmentation can be performed according to punctuation and a preset special text, the preset special text can comprise a preset initial text and a preset final text, and the preset initial text is a text which is obtained according to sample sentence statistics and is used as a starting word of the sentence, wherein the first probability of the first probability is greater than a first preset probability threshold; the preset end text is a text which is obtained according to sample sentence statistics and used as an end word of a sentence, and the second probability of the end word is larger than a second preset probability threshold value.
It should be noted that, a ratio of the number of times a word is used as a starting word in a sample sentence to the number of times the word appears in the sample sentence may be used as a first probability of the word being used as the starting word of the sentence; likewise, the ratio of the number of times a word appears in the sample sentence as an ending word to the number of times the word appears in the sample sentence may be used as the second probability that the word appears as an ending word of the sentence.
Thus, the document is segmented to obtain one or more text sentences, and each text sentence is divided according to the plurality of preset dividing modes to obtain a plurality of groups of target sentence divided text sets; inputting the multiple groups of target sentence dividing text sets into the text vector model to obtain a second text vector corresponding to the text sentence; finally, a text knowledge base can be established according to the second text vector, the text sentence and the document. For example, a correspondence of the second text vector to the text sentence and the document may be established. By adopting the method, the document or the text sentence can be conveniently acquired through the second text vector, and the text searching efficiency is improved.
In addition, because the first text input by the user during searching is generally a word or a sentence, the text is rarely a large text in the document, so that the mode of dividing the document into a plurality of text sentences can be more suitable for the first text to be searched input by the user, in the text searching process, the similarity between the first text vector corresponding to the first text to be searched and the second text vector corresponding to the text sentence is only needed to be calculated, and the first text and the whole document do not need to be compared and searched, so that the text searching efficiency can be further improved.
Further, the method for obtaining the target text vector from the second text vector in the pre-established text knowledge base according to the first text vector may include the following steps:
first, a candidate text vector closest to the first text vector is obtained from a second text vector in the text knowledge base.
For example, neighbor searching may be implemented according to the hierarchical navigable small world HNSW algorithm, so as to quickly obtain the candidate text vector closest to the first text vector
And secondly, acquiring the target text vector from the candidate text vector according to the similarity between the candidate text vector and the first text vector.
In this step, a candidate text vector having a similarity with the first text vector greater than or equal to a fourth preset similarity threshold may be used as the target text vector; or sorting the similarity between the candidate text vector and the first text vector according to the sequence from big to small, and taking the second preset number of candidate text vectors with the similarity sorted in front as target text vectors.
In this way, in the case that a large number of second text vectors exist in the text knowledge base, neighbor searching can be achieved according to the hierarchical navigable small-world HNSW algorithm, candidate text vectors closest to the first text vectors can be rapidly obtained, and then target text vectors can be obtained according to the similarity. In this way, the efficiency of text searching can be further improved.
Fig. 4 is a block diagram illustrating a text searching apparatus according to an exemplary embodiment. As shown in fig. 4, the text search device includes:
the first text dividing module 401 is configured to divide a first text to be searched according to a plurality of preset dividing modes, so as to obtain a plurality of groups of target divided text sets;
a first text vector obtaining module 402, configured to input the multiple sets of target divided text sets into a pre-trained text vector model, to obtain a first text vector;
A target text vector obtaining module 403, configured to obtain a target text vector from second text vectors in a pre-established text knowledge base according to the first text vector, where the text knowledge base includes one or more second text vectors, and a second text corresponding to the second text vectors;
and the target text searching module 404 is configured to take the second text corresponding to the target text vector as a target searching result, and display the target searching result.
Optionally, the preset dividing mode includes word division and character division, and the multiple target division text sets include a first target division text set and a second target division text set; the first text dividing module 401 is configured to:
performing the word segmentation on the first text to obtain the first target segmentation text set containing one or more target words;
performing the character division on the first text to obtain the second target division text set containing one or more target characters;
inputting the plurality of sets of target divided text sets into a pre-trained text vector model, obtaining a first text vector comprising:
and inputting the first target divided text set and the second target divided text set into a pre-trained text vector model to obtain the first text vector.
Optionally, the text vector model includes a word encoding network and a character encoding network; the first text vector obtaining module 402 is configured to obtain a first text vector by:
the word coding network is used for coding the first target divided text set to obtain word vectors;
encoding the second target divided text set through the character encoding network to obtain a character vector;
and calculating the first text vector according to the word vector and the character vector.
FIG. 5 is a training apparatus of a text vector model, as shown in FIG. 5, according to an exemplary embodiment, the training apparatus including;
a training sample acquiring module 501, configured to acquire a training sample set, where the training sample set includes a plurality of training sample pairs and a similarity of each training sample pair;
a first loss function determining module 502, configured to determine a first loss function according to the training sample pairs and the similarity of the training sample pairs, where the first loss function is configured to restrict the similarity of each training sample pair to meet a preset correlation similarity requirement;
the model training module 503 is configured to train a preset model according to the training sample set and the first loss function to obtain the text vector model.
FIG. 6 is a training apparatus of another text vector model shown according to an exemplary embodiment, as shown in FIG. 6, the training apparatus further comprising;
a second loss function determining module 601, configured to determine a second loss function according to a plurality of training sample pairs in the training sample set; the second loss function is used for restraining the similarity between the text in each training sample pair and the text in other training sample pairs in the training sample set to meet the preset uncorrelated similarity requirement;
the model training module 503 is configured to train a preset model according to the training sample set, the first loss function, and the second loss function to obtain the text vector model.
Optionally, the training sample acquiring module 501 is configured to:
according to the historical operation behavior information of a user on a plurality of historical search results, respectively determining the similarity between each historical search result and the historical search information, wherein the historical search results are obtained by searching according to the historical search information input by the user;
according to the similarity, the historical search information and the historical search result are used as the training sample pair;
a set of a plurality of the training sample pairs is taken as the training sample set.
Fig. 7 is a block diagram illustrating another text searching apparatus according to an exemplary embodiment. As shown in fig. 7, the text search device further includes a text knowledge base creation module 701, the second text includes text sentences, and the text knowledge base creation module 701 is configured to create the text knowledge base in advance by:
dividing the text sentence according to the plurality of preset dividing modes to obtain a plurality of groups of target sentence dividing text sets;
inputting the multiple groups of target sentence dividing text sets into the text vector model to obtain a second text vector corresponding to the text sentence;
and establishing the text knowledge base according to the second text vector and the text sentence.
Optionally, the second text further includes a document, and the text sentence is a sentence obtained by performing sentence segmentation on the document; the text knowledge base building module 701 is further configured to build the text knowledge base according to the second text vector, the text sentence, and the document.
Optionally, the target text vector obtaining module 403 is configured to obtain, from a second text vector in the text knowledge base, a candidate text vector closest to the first text vector; and acquiring the target text vector from the candidate text vector according to the similarity of the candidate text vector and the first text vector.
In summary, the method includes that a first text to be searched is divided according to a plurality of preset division modes to obtain a plurality of groups of target division text sets, and the plurality of groups of target division text sets can effectively avoid misspelling of a user and reflect real expectations of the user; then inputting the multiple groups of target division text sets into a pre-trained text vector model to obtain a first text vector; and obtaining a target text vector and a second text corresponding to the target text vector from a pre-established text knowledge base in a vector search mode according to the first text vector, so as to obtain a target search result. Therefore, the accuracy of text searching can be improved, so that the occurrence of incorrect or incomplete search results caused by misspelling of a user is avoided, and the experience of the user is improved.
Referring now to fig. 8, a schematic diagram of an electronic device 800 suitable for use in implementing embodiments of the present disclosure is shown. The terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 8 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.
As shown in fig. 8, the electronic device 800 may include a processing means (e.g., a central processor, a graphics processor, etc.) 801, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 802 or a program loaded from a storage means 808 into a Random Access Memory (RAM) 803. In the RAM803, various programs and data required for the operation of the electronic device 800 are also stored. The processing device 801, the ROM802, and the RAM803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.
In general, the following devices may be connected to the I/O interface 805: input devices 806 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, and the like; an output device 807 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, etc.; storage 808 including, for example, magnetic tape, hard disk, etc.; communication means 809. The communication means 809 may allow the electronic device 800 to communicate wirelessly or by wire with other devices to exchange data. While fig. 8 shows an electronic device 800 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.
In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via communication device 809, or installed from storage device 808, or installed from ROM 802. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing device 801.
It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.
In some implementations, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.
The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: dividing a first text to be searched according to a plurality of preset dividing modes to obtain a plurality of groups of target divided text sets; inputting the multiple groups of target division text sets into a pre-trained text vector model to obtain a first text vector; obtaining a target text vector from second text vectors of a pre-established text knowledge base according to the first text vector, wherein the text knowledge base comprises one or more second text vectors and second texts corresponding to the second text vectors; and taking the second text corresponding to the target text vector as a target search result, and displaying the target search result.
Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present disclosure may be implemented in software or hardware. The name of the module is not limited to the module itself in some cases, for example, the first text dividing module may also be described as "a module for dividing the first text to be searched according to multiple preset dividing modes, so as to obtain multiple groups of target divided text sets".
The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
According to one or more embodiments of the present disclosure, example 1 provides a text search method, including: dividing a first text to be searched according to a plurality of preset dividing modes to obtain a plurality of groups of target divided text sets; inputting the multiple groups of target division text sets into a pre-trained text vector model to obtain a first text vector; obtaining a target text vector from second text vectors of a pre-established text knowledge base according to the first text vector, wherein the text knowledge base comprises one or more second text vectors and second texts corresponding to the second text vectors; and taking the second text corresponding to the target text vector as a target search result, and displaying the target search result.
In accordance with one or more embodiments of the present disclosure, example 2 provides the method of example 1, the preset division manner including word division and character division, the plurality of target division text sets including a first target division text set and a second target division text set; dividing the first text to be searched according to a plurality of preset dividing modes, and obtaining a plurality of groups of target divided text sets comprises: performing word division on the first text to obtain a first target division text set containing one or more target words; performing the character division on the first text to obtain the second target division text set containing one or more target characters; inputting the multiple sets of target divided text sets into a pre-trained text vector model, obtaining a first text vector comprising: and inputting the first target divided text set and the second target divided text set into a pre-trained text vector model to obtain the first text vector.
Example 3 provides the method of example 2, the text vector model comprising a word encoding network and a character encoding network, according to one or more embodiments of the present disclosure; the text vector model is used for: the word coding network is used for coding the first target divided text set to obtain word vectors; coding the second target divided text set through the character coding network to obtain a character vector; and calculating the first text vector according to the word vector and the character vector.
In accordance with one or more embodiments of the present disclosure, example 4 provides the method of example 1, the text vector model being pre-trained by: acquiring a training sample set, wherein the training sample set comprises a plurality of training sample pairs and the similarity of each training sample pair; determining a first loss function according to the training sample pairs and the similarity of the training sample pairs, wherein the first loss function is used for restraining the similarity of each training sample pair to meet the requirement of preset correlation similarity; and training a preset model according to the training sample set and the first loss function to obtain the text vector model.
According to one or more embodiments of the present disclosure, example 5 provides the method of example 4, before training a preset model according to the training sample set and the first loss function to obtain a text vector model, further including: determining a second loss function according to a plurality of training sample pairs in the training sample set; the second loss function is used for restraining the similarity between the text in each training sample pair and the text in other training sample pairs in the training sample set to meet a preset uncorrelated similarity requirement; training a preset model according to the training sample set and the first loss function to obtain the text vector model comprises the following steps: and training a preset model according to the training sample set, the first loss function and the second loss function to obtain the text vector model.
In accordance with one or more embodiments of the present disclosure, example 6 provides the method of example 4, the obtaining a training sample set comprising: according to historical operation behavior information of a user on a plurality of historical search results, similarity between each historical search result and the historical search information is respectively determined, and the historical search results are obtained by searching according to the historical search information input by the user; according to the similarity, the historical search information and the historical search result are used as the training sample pair; and taking a set of a plurality of training sample pairs as the training sample set.
Example 7 provides the method of example 1, the second text comprising a text sentence, in accordance with one or more embodiments of the present disclosure; the text knowledge base is pre-established by: dividing the text sentence according to the plurality of preset dividing modes to obtain a plurality of groups of target sentence dividing text sets; inputting the multiple groups of target sentence dividing text sets into the text vector model to obtain a second text vector corresponding to the text sentence; and establishing the text knowledge base according to the second text vector and the text sentence.
Example 8 provides the method of example 7, wherein the second text further comprises a document, and the text sentence is a sentence obtained by sentence segmentation of the document, according to one or more embodiments of the present disclosure; establishing the text knowledge base from the second text vector and the text sentence comprises: and establishing the text knowledge base according to the second text vector, the text sentence and the document.
Example 9 provides the method of example 8, according to one or more embodiments of the present disclosure, the obtaining a target text vector from a second text vector of a pre-established text knowledge base according to the first text vector comprising: obtaining a candidate text vector closest to the first text vector from a second text vector in the text knowledge base; and acquiring the target text vector from the candidate text vector according to the similarity between the candidate text vector and the first text vector.
According to one or more embodiments of the present disclosure, example 10 provides a text search apparatus, comprising: the first text dividing module 401 is configured to divide a first text to be searched according to a plurality of preset dividing modes, so as to obtain a plurality of groups of target divided text sets; a first text vector obtaining module 402, configured to input the multiple sets of target divided text sets into a pre-trained text vector model, to obtain a first text vector; a target text vector obtaining module 403, configured to obtain a target text vector from second text vectors in a pre-established text knowledge base according to the first text vector, where the text knowledge base includes one or more second text vectors and second text corresponding to the second text vectors; and the target text searching module 404 is configured to take the second text corresponding to the target text vector as a target searching result, and display the target searching result.
According to one or more embodiments of the present disclosure, example 11 provides a computer-readable medium having stored thereon a computer program which, when executed by a processing device, implements the steps of the methods described in examples 1 to 9.
Example 12 provides an electronic device according to one or more embodiments of the present disclosure, comprising: a storage device having a computer program stored thereon; processing means for executing the computer program in the storage means to realize the steps of the method described in examples 1 to 9.
The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).
Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims. The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Claims (10)

1. A text search method, the method comprising:
dividing a first text to be searched according to a plurality of preset dividing modes to obtain a plurality of groups of target divided text sets;
inputting the multiple groups of target division text sets into a pre-trained text vector model to obtain a first text vector;
obtaining a target text vector from second text vectors of a pre-established text knowledge base according to the first text vector, wherein the text knowledge base comprises one or more second text vectors and second texts corresponding to the second text vectors;
taking a second text corresponding to the target text vector as a target search result, and displaying the target search result;
The text vector model is pre-trained by:
acquiring a training sample set, wherein the training sample set comprises a plurality of training sample pairs and the similarity of each training sample pair;
determining a first loss function according to the training sample pairs and the similarity of the training sample pairs, wherein the first loss function is used for restraining the similarity of each training sample pair to meet the requirement of preset correlation similarity; the preset correlation similarity requirement is a similarity range corresponding to the training sample pair;
training a preset model according to the training sample set and the first loss function to obtain the text vector model;
the acquiring a training sample set includes:
according to the historical operation behavior information of the user on the plurality of historical search results, the similarity between each historical search result and the historical search information is respectively determined; the historical search results are obtained by searching according to the historical search information input by a user, the historical search information comprises a text to be searched input by the user in a historical period, and the historical operation behavior information comprises user click behavior information and user browsing behavior information for the historical search results;
According to the similarity, the historical search information and the historical search result are used as the training sample pair;
setting a similarity range between the historical search result and the historical search information according to the similarity between the historical search result and the historical search information acquired for multiple times at different times, wherein the similarity range is used for representing the maximum value and the minimum value of the similarity between the historical search result and the historical search information;
and taking a set of a plurality of training sample pairs as the training sample set.
2. The method of claim 1, wherein the preset division manner includes word division and character division, and the plurality of target division text sets includes a first target division text set and a second target division text set; dividing the first text to be searched according to a plurality of preset dividing modes, and obtaining a plurality of groups of target divided text sets comprises:
performing word division on the first text to obtain a first target division text set containing one or more target words;
performing the character division on the first text to obtain the second target division text set containing one or more target characters;
Inputting the multiple sets of target divided text sets into a pre-trained text vector model, obtaining a first text vector comprising:
and inputting the first target divided text set and the second target divided text set into a pre-trained text vector model to obtain the first text vector.
3. The method of claim 2, wherein the text vector model includes a word encoding network and a character encoding network; the text vector model is used for:
the word coding network is used for coding the first target divided text set to obtain word vectors;
coding the second target divided text set through the character coding network to obtain a character vector;
and calculating the first text vector according to the word vector and the character vector.
4. The method of claim 1, wherein prior to training a pre-set model based on the training sample set and the first loss function to obtain a text vector model, the method further comprises:
determining a second loss function according to a plurality of training sample pairs in the training sample set; the second loss function is used for restraining the similarity between the text in each training sample pair and the text in other training sample pairs in the training sample set to meet a preset uncorrelated similarity requirement;
Training a preset model according to the training sample set and the first loss function to obtain the text vector model comprises the following steps:
and training a preset model according to the training sample set, the first loss function and the second loss function to obtain the text vector model.
5. The method of claim 1, wherein the second text comprises a text sentence; the text knowledge base is pre-established by:
dividing the text sentence according to the plurality of preset dividing modes to obtain a plurality of groups of target sentence dividing text sets;
inputting the multiple groups of target sentence dividing text sets into the text vector model to obtain a second text vector corresponding to the text sentence;
and establishing the text knowledge base according to the second text vector and the text sentence.
6. The method according to claim 5, wherein the second text further includes a document, and the text sentence is a sentence obtained by subjecting the document to sentence segmentation; establishing the text knowledge base from the second text vector and the text sentence comprises:
and establishing the text knowledge base according to the second text vector, the text sentence and the document.
7. The method of claim 1, wherein the obtaining a target text vector from a second text vector of a pre-established text knowledge base from the first text vector comprises:
obtaining a candidate text vector closest to the first text vector from a second text vector in the text knowledge base;
and acquiring the target text vector from the candidate text vector according to the similarity between the candidate text vector and the first text vector.
8. A text search device, the device comprising:
the first text dividing module is used for dividing the first text to be searched according to a plurality of preset dividing modes to obtain a plurality of groups of target divided text sets;
the first text vector acquisition module is used for inputting the multiple groups of target division text sets into a pre-trained text vector model to obtain a first text vector;
the target text vector acquisition module is used for acquiring a target text vector from second text vectors of a pre-established text knowledge base according to the first text vector, wherein the text knowledge base comprises one or more second text vectors and second texts corresponding to the second text vectors;
The target text searching module is used for taking the second text corresponding to the target text vector as a target searching result and displaying the target searching result;
the apparatus further comprises;
the training sample acquisition module is used for acquiring a training sample set, wherein the training sample set comprises a plurality of training sample pairs and the similarity of each training sample pair;
the first loss function determining module is used for determining a first loss function according to the training sample pairs and the similarity of the training sample pairs, wherein the first loss function is used for restraining the similarity of each training sample pair to meet the requirement of the preset correlation similarity; the preset correlation similarity requirement is a similarity range corresponding to the training sample pair;
the model training module is used for training a preset model according to the training sample set and the first loss function to obtain the text vector model;
the training sample acquisition module is used for respectively determining the similarity between each historical search result and the historical search information according to the historical operation behavior information of the user on the plurality of historical search results; the historical search results are obtained by searching according to the historical search information input by a user, the historical search information comprises a text to be searched input by the user in a historical period, and the historical operation behavior information comprises user click behavior information and user browsing behavior information for the historical search results; according to the similarity, the historical search information and the historical search result are used as the training sample pair; setting a similarity range between the historical search result and the historical search information according to the similarity between the historical search result and the historical search information acquired for multiple times at different times, wherein the similarity range is used for representing the maximum value and the minimum value of the similarity between the historical search result and the historical search information; and taking a set of a plurality of training sample pairs as the training sample set.
9. A computer readable medium on which a computer program is stored, characterized in that the program, when being executed by a processing device, carries out the steps of the method according to any one of claims 1 to 7.
10. An electronic device, comprising:
a storage device having a computer program stored thereon;
processing means for executing said computer program in said storage means to carry out the steps of the method of any one of claims 1 to 7.
CN202110726639.8A 2021-06-29 2021-06-29 Text searching method and device, readable medium and electronic equipment Active CN113407814B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110726639.8A CN113407814B (en) 2021-06-29 2021-06-29 Text searching method and device, readable medium and electronic equipment
PCT/CN2022/090994 WO2023273598A1 (en) 2021-06-29 2022-05-05 Text search method and apparatus, and readable medium and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110726639.8A CN113407814B (en) 2021-06-29 2021-06-29 Text searching method and device, readable medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN113407814A CN113407814A (en) 2021-09-17
CN113407814B true CN113407814B (en) 2023-06-16

Family

ID=77680089

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110726639.8A Active CN113407814B (en) 2021-06-29 2021-06-29 Text searching method and device, readable medium and electronic equipment

Country Status (2)

Country Link
CN (1) CN113407814B (en)
WO (1) WO2023273598A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113407814B (en) * 2021-06-29 2023-06-16 抖音视界有限公司 Text searching method and device, readable medium and electronic equipment
CN117251557B (en) * 2023-11-20 2024-02-27 中信证券股份有限公司 Legal consultation sentence reply method, device, equipment and computer readable medium
CN117349400B (en) * 2023-12-04 2024-02-27 环球数科集团有限公司 Prompt word construction method based on AIGC

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108073574A (en) * 2016-11-16 2018-05-25 三星电子株式会社 For handling the method and apparatus of natural language and training natural language model
WO2019127924A1 (en) * 2017-12-29 2019-07-04 深圳云天励飞技术有限公司 Sample weight allocation method, model training method, electronic device, and storage medium
CN112632403A (en) * 2020-12-24 2021-04-09 北京百度网讯科技有限公司 Recommendation model training method, recommendation device, recommendation equipment and recommendation medium
CN112700766A (en) * 2020-12-23 2021-04-23 北京猿力未来科技有限公司 Training method and device of voice recognition model and voice recognition method and device
CN112860848A (en) * 2021-01-20 2021-05-28 平安科技(深圳)有限公司 Information retrieval method, device, equipment and medium
CN113033622A (en) * 2021-03-05 2021-06-25 北京百度网讯科技有限公司 Training method, device, equipment and storage medium for cross-modal retrieval model

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111078858B (en) * 2018-10-19 2023-06-09 阿里巴巴集团控股有限公司 Article searching method and device and electronic equipment
US11321312B2 (en) * 2019-01-14 2022-05-03 ALEX—Alternative Experts, LLC Vector-based contextual text searching
CN109981787B (en) * 2019-04-03 2022-03-29 北京字节跳动网络技术有限公司 Method and device for displaying information
CN111753551B (en) * 2020-06-29 2022-06-14 北京字节跳动网络技术有限公司 Information generation method and device based on word vector generation model
CN112100529B (en) * 2020-11-17 2021-03-19 北京三快在线科技有限公司 Search content ordering method and device, storage medium and electronic equipment
CN112434173B (en) * 2021-01-26 2021-04-20 浙江口碑网络技术有限公司 Search content output method and device, computer equipment and readable storage medium
CN113407814B (en) * 2021-06-29 2023-06-16 抖音视界有限公司 Text searching method and device, readable medium and electronic equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108073574A (en) * 2016-11-16 2018-05-25 三星电子株式会社 For handling the method and apparatus of natural language and training natural language model
WO2019127924A1 (en) * 2017-12-29 2019-07-04 深圳云天励飞技术有限公司 Sample weight allocation method, model training method, electronic device, and storage medium
CN112700766A (en) * 2020-12-23 2021-04-23 北京猿力未来科技有限公司 Training method and device of voice recognition model and voice recognition method and device
CN112632403A (en) * 2020-12-24 2021-04-09 北京百度网讯科技有限公司 Recommendation model training method, recommendation device, recommendation equipment and recommendation medium
CN112860848A (en) * 2021-01-20 2021-05-28 平安科技(深圳)有限公司 Information retrieval method, device, equipment and medium
CN113033622A (en) * 2021-03-05 2021-06-25 北京百度网讯科技有限公司 Training method, device, equipment and storage medium for cross-modal retrieval model

Also Published As

Publication number Publication date
CN113407814A (en) 2021-09-17
WO2023273598A1 (en) 2023-01-05

Similar Documents

Publication Publication Date Title
CN110647614B (en) Intelligent question-answering method, device, medium and electronic equipment
CN113407814B (en) Text searching method and device, readable medium and electronic equipment
CN109947919B (en) Method and apparatus for generating text matching model
CN110969012A (en) Text error correction method and device, storage medium and electronic equipment
CN114861889B (en) Deep learning model training method, target object detection method and device
EP3916579A1 (en) Method for resource sorting, method for training sorting model and corresponding apparatuses
US9684726B2 (en) Realtime ingestion via multi-corpus knowledge base with weighting
CN114385780B (en) Program interface information recommendation method and device, electronic equipment and readable medium
WO2023279843A1 (en) Content search method, apparatus and device, and storage medium
US20180189307A1 (en) Topic based intelligent electronic file searching
US20200159765A1 (en) Performing image search using content labels
CN113011172B (en) Text processing method, device, computer equipment and storage medium
CN111538903B (en) Method and device for determining search recommended word, electronic equipment and computer readable medium
CN111538830B (en) French searching method, device, computer equipment and storage medium
CN112052297A (en) Information generation method and device, electronic equipment and computer readable medium
CN112182255A (en) Method and apparatus for storing media files and for retrieving media files
CN111078849A (en) Method and apparatus for outputting information
CN112632285A (en) Text clustering method and device, electronic equipment and storage medium
JP2022541832A (en) Method and apparatus for retrieving images
CN109472028B (en) Method and device for generating information
CN114742058B (en) Named entity extraction method, named entity extraction device, computer equipment and storage medium
CN115858732A (en) Entity linking method and device
CN113221572B (en) Information processing method, device, equipment and medium
CN115640523A (en) Text similarity measurement method, device, equipment, storage medium and program product
CN111737571B (en) Searching method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100041 B-0035, 2 floor, 3 building, 30 Shixing street, Shijingshan District, Beijing.

Applicant after: Tiktok vision (Beijing) Co.,Ltd.

Address before: 100041 B-0035, 2 floor, 3 building, 30 Shixing street, Shijingshan District, Beijing.

Applicant before: BEIJING BYTEDANCE NETWORK TECHNOLOGY Co.,Ltd.

Address after: 100041 B-0035, 2 floor, 3 building, 30 Shixing street, Shijingshan District, Beijing.

Applicant after: Douyin Vision Co.,Ltd.

Address before: 100041 B-0035, 2 floor, 3 building, 30 Shixing street, Shijingshan District, Beijing.

Applicant before: Tiktok vision (Beijing) Co.,Ltd.

GR01 Patent grant
GR01 Patent grant