CN117149990A - Text retrieval method, text retrieval device, electronic equipment and storage medium - Google Patents

Text retrieval method, text retrieval device, electronic equipment and storage medium Download PDF

Info

Publication number
CN117149990A
CN117149990A CN202210541779.2A CN202210541779A CN117149990A CN 117149990 A CN117149990 A CN 117149990A CN 202210541779 A CN202210541779 A CN 202210541779A CN 117149990 A CN117149990 A CN 117149990A
Authority
CN
China
Prior art keywords
vector
text
inverted
index
recall
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210541779.2A
Other languages
Chinese (zh)
Inventor
林伟家
刘子甲
王志强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin 3600 Kuaikan Technology Co ltd
Original Assignee
Tianjin 3600 Kuaikan Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin 3600 Kuaikan Technology Co ltd filed Critical Tianjin 3600 Kuaikan Technology Co ltd
Priority to CN202210541779.2A priority Critical patent/CN117149990A/en
Publication of CN117149990A publication Critical patent/CN117149990A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/319Inverted lists
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application discloses a text retrieval method, a text retrieval device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a user search request; encoding the query text input by the user through a deep learning model to obtain a second vector; inquiring a third vector with highest similarity with the second vector from a first vector index, wherein the first vector index is obtained by splitting a document library to be retrieved into a plurality of independent sub-texts and then coding the split sub-texts by using a deep learning model; and taking the sub text corresponding to the third vector as a target text. The method carries out text retrieval based on semantic vector retrieval, and improves the text retrieval performance and the accuracy of the retrieval result.

Description

Text retrieval method, text retrieval device, electronic equipment and storage medium
Technical Field
The application belongs to the technical field of data mining, and particularly relates to a text retrieval method, a text retrieval device, electronic equipment and a storage medium.
Background
Currently, network operations based on text mining 100 are widely applied in the scenes of risk management 101, knowledge management 102, network crime prevention management 103, customer service 104, insurance claim 105, contextual advertisement recommendation 106, business intelligence 107, mail filtering 108, social media analysis 109 and the like, as shown in fig. 1, the user behaviors are mainly known through the combination of text analysis technology and traditional statistical analysis technology, and products and services are more accurately provided on websites; meanwhile, text analysis technology is used for text information processing, and processed text content is directly pushed to a user as an output result of the online service.
In the traditional text retrieval method, semantic analysis is carried out on text word segmentation or word segmentation, meaning of words in sentences or words in sentences is adopted to form semantic information with one layer of granularity, and the semantic information with one layer of granularity is retrieved, but the semantic information with one layer of granularity is lost, so that consideration of correlation among the semantic information is lacked, the recall capability of correlation at the semantic level is weak, and the text retrieval accuracy is poor.
Disclosure of Invention
The embodiment of the application aims to provide a text retrieval method, a device, electronic equipment and a storage medium, which are used for performing text retrieval based on semantic vector retrieval, so that the text retrieval performance and the accuracy of retrieval results are improved.
In order to solve the technical problems, the application is realized as follows:
in a first aspect, an embodiment of the present application provides a text retrieval method, including:
acquiring a user search request;
encoding the query text input by the user through a deep learning model to obtain a second vector;
querying a third vector with highest similarity with the second vector from the first vector index,
the first vector index is obtained by splitting a document library to be retrieved into a plurality of independent sub-texts and then coding the split sub-texts by using a deep learning model;
And taking the sub text corresponding to the third vector as a target text.
Optionally, the method further comprises:
generating an inverted file of the document library to be retrieved;
and generating a first inverted index of the document library to be retrieved according to the inverted file.
Optionally, the querying, from the first vector index, a third vector having the highest similarity with the second vector includes:
splitting the query text into a plurality of independent segmentations;
querying inverted chain data corresponding to each word in the first inverted index;
finding at least one center point which is at a preset distance from the second vector in the first vector index, and acquiring inverted chain data corresponding to each center point;
intersection is calculated on inverted chain data of each word segment, and a first weight value is obtained;
obtaining a union set of inverted chain data corresponding to each center point to obtain a second weight value;
comparing the first weight value with the second weight value, filtering inverted chain data with large weight value, and storing in a recall intermediate result data set when preset filtering conditions are met;
and sequencing the recall intermediate result data set, and determining the third vector.
Optionally, the intermediate result data set reaches a preset first storage capacity threshold, or
In case the time retrieved with the query text exceeds a preset first time threshold,
terminating collection of the recall intermediate result data set.
Optionally, the sorting the recall intermediate result data set, determining the third vector, includes:
and sequencing all the inverted chain data stored in the recall intermediate result data set according to the score, and intercepting the inverted chain data ranked at the top as the third vector.
Optionally, the querying inverted chain data corresponding to each word segment in the first inverted index includes:
acquiring document number ID information of each word;
and arranging the document number ID information of all the segmented words in a descending order to form the inverted chain data.
Optionally, the method further comprises:
determining a weight value of each word according to the document number ID information of each word;
determining recall time of each word segmentation according to the size of each word segmentation weight value;
and cutting off the retrieval process according to the recall time and the chain length of each piece of inverted chain data.
Optionally, the method further comprises:
writing the first vector index and the first inverted index into a memory segment together to construct a temporary memory vector index;
when the temporary memory vector index reaches a preset second storage capacity threshold, or the time for constructing the temporary memory vector index reaches a preset second time threshold,
and writing the first vector index and the first inverted index into a disk segment together to construct a persistent disk vector index.
In a second aspect, an embodiment of the present application provides a text retrieval apparatus, including:
the acquisition module is used for acquiring a user search request;
the coding module is used for coding the query text input by the user through the deep learning model to obtain a second vector;
a retrieval module for querying a third vector with highest similarity with the second vector from the first vector index,
the first vector index is obtained by splitting a document library to be retrieved into a plurality of independent sub-texts and then coding the split sub-texts by using a deep learning model;
and taking the sub text corresponding to the third vector as a target text.
Optionally, the device further includes:
generating an inverted file of the document library to be retrieved;
And generating a first inverted index of the document library to be retrieved according to the inverted file.
Optionally, the querying, from the first vector index, a third vector having the highest similarity with the second vector includes:
splitting the query text into a plurality of independent segmentations;
querying inverted chain data corresponding to each word in the first inverted index;
finding at least one center point which is at a preset distance from the second vector in the first vector index, and acquiring inverted chain data corresponding to each center point;
intersection is calculated on inverted chain data of each word segment, and a first weight value is obtained;
obtaining a union set of inverted chain data corresponding to each center point to obtain a second weight value;
comparing the first weight value with the second weight value, filtering inverted chain data with large weight value, and storing in a recall intermediate result data set when preset filtering conditions are met;
and sequencing the recall intermediate result data set, and determining the third vector.
Optionally, the intermediate result data set reaches a preset first storage capacity threshold, or
In case the time retrieved with the query text exceeds a preset first time threshold,
Terminating collection of the recall intermediate result data set.
Optionally, the sorting the recall intermediate result data set, determining the third vector, includes:
and sequencing all the inverted chain data stored in the recall intermediate result data set according to the score, and intercepting the inverted chain data ranked at the top as the third vector.
Optionally, the querying inverted chain data corresponding to each word segment in the first inverted index includes:
acquiring document number ID information of each word;
and arranging the document number ID information of all the segmented words in a descending order to form the inverted chain data.
Optionally, the device further includes:
determining a weight value of each word according to the document number ID information of each word;
determining recall time of each word segmentation according to the size of each word segmentation weight value;
and cutting off the retrieval process according to the recall time and the chain length of each piece of inverted chain data.
Optionally, the device further includes a storage module, configured to:
writing the first vector index and the first inverted index into a memory segment together to construct a temporary memory vector index;
When the temporary memory vector index reaches a preset second storage capacity threshold, or the time for constructing the temporary memory vector index reaches a preset second time threshold,
and writing the first vector index and the first inverted index into a disk segment together to construct a persistent disk vector index.
In a third aspect, an embodiment of the present application provides an electronic device, including: a processor, a memory and a program or instructions stored on the memory and executable on the processor, which when executed by the processor, implement the steps of the text retrieval method described above.
In a fourth aspect, embodiments of the present application provide a readable storage medium having stored thereon a program or instructions which when executed by a processor perform the steps of the text retrieval method described above.
In a fifth aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and the processor is configured to execute a program or instructions to implement the steps of the text retrieval method described above.
In the embodiment of the application, semantic vector retrieval is combined with text retrieval, text retrieval is carried out based on vector retrieval, after a user search request is acquired, a query text input by a user is encoded through a deep learning model to obtain a second vector, then a first vector index obtained by splitting a document library to be retrieved into a plurality of independent sub-texts and then encoding through the deep learning model is utilized, a third vector with highest similarity with the second vector is queried from the first vector index, and the sub-text corresponding to the third vector is taken as a target text. According to the text retrieval method, from the consideration of correlation between semantic information, the semantic recall effect of semantic vector retrieval under a scene combined with text retrieval is greatly improved, and the text retrieval performance and the accuracy of retrieval results are improved.
Drawings
FIG. 1 is a schematic diagram of a typical application scenario for text retrieval;
FIG. 2 is a block diagram of an application system of a text retrieval method according to an embodiment of the present application;
FIG. 3 is a schematic flow chart of a text retrieval method according to an embodiment of the present application;
fig. 4 is a schematic flowchart of a text retrieval method step S3 according to an embodiment of the present application;
fig. 5 is a schematic block diagram of a text retrieval device according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application;
fig. 7 is a schematic diagram of a specific hardware structure of an electronic device according to an embodiment of the present application;
wherein:
100-text mining;
101-risk management;
102-knowledge management;
103-cyber crime prevention management;
104-customer service;
105-insurance claim;
106-contextual advertisement recommendation;
107-business intelligence;
108-mail filtering;
109-social media analysis;
201-a user side;
202-a search engine server;
400-text retrieval means;
401-an acquisition module;
402-an encoding module;
403-a retrieval module;
404-a memory module;
500-an electronic device;
501-a processor;
502-memory;
600-an electronic device;
601-a radio frequency unit;
602-a network module;
603-an audio output unit;
604-an input unit;
6041-graphics processor;
6042 microphone;
605-a sensor;
606-a display unit;
6061-display panel;
607-user input unit;
6071-touch panel;
6072-other input device;
608-an interface unit;
609-memory;
610-a processor.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The terms first, second and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the application may be practiced otherwise than as specifically illustrated or described herein. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/", generally means that the associated object is an "or" relationship.
As described above, the existing conventional text retrieval lacks consideration of correlation between semantic information, and has the problems of weak recall capability of correlation at semantic level and poor text retrieval accuracy. Therefore, the embodiment of the application provides a text retrieval method scheme based on vector index.
The text retrieval method provided by the embodiment of the application is described in detail below through specific embodiments and application scenes thereof with reference to the accompanying drawings.
FIG. 2 is a block diagram of an application system of a text retrieval method according to an embodiment of the present application;
the text retrieval system mainly comprises a user terminal 201 and a search engine server terminal 202, wherein the user terminal 201 is in communication interaction with the search engine server terminal 202, the user terminal sends an online search request to the search engine server terminal 202, the search engine server terminal 202 stores a document database required for retrieval, and after receiving the online search request of a user, related document inquiry is carried out according to the search request.
Fig. 3 is a flowchart of a text retrieval method according to an embodiment of the present application.
Referring to fig. 3, a text retrieval method provided in an embodiment of the present application includes:
s1, acquiring a user search request;
s2, coding the query text input by the user through a deep learning model to obtain second vectors.
In a specific implementation, after a user inputs a query text in a search engine, a query result, namely a target text, is obtained from a document library to be retrieved. The document library to be searched is a search engine database of a user, and takes the search of academic papers as an example, the document library to be searched adopts a knowledge network database.
After the user search request is acquired, the query text input by the user is encoded through the deep learning model to obtain a second vector, namely the second vector is the vector of the query text input by the user. The user search request may be in the form of an online search request by the user or in the form of a non-online search request by the user.
In this embodiment, text retrieval is specifically combined with vector indexing, and text retrieval is performed based on a first vector index by constructing the first vector index of a document library to be retrieved. Specifically, after splitting a document library to be searched into a plurality of independent sub-texts, coding the split sub-texts by using a first deep learning model to obtain a first vector index of the document library to be searched.
In addition, the query text can be selected to adopt the same deep learning model with the document library to be searched, and the query text input by the user is encoded through the first deep learning model to obtain a second vector.
In addition, the document library to be searched is split into a plurality of independent sub-texts, the split sub-texts are coded by using a deep learning model to obtain a first vector index of the document library to be searched, an inverted file of the document library to be searched is generated, and the first inverted index of the document library to be searched is generated according to the inverted file.
S3, inquiring a third vector with highest similarity with the second vector from the first vector index.
In a specific implementation, as shown in fig. 4, fig. 4 shows a specific flow diagram of step S3; the querying the third vector with the highest similarity with the second vector from the first vector index comprises the following steps:
s31, splitting the query text into a plurality of independent segmented words, and querying inverted chain data corresponding to each segmented word in the first inverted index.
In a specific implementation, the querying, in the first inverted index, inverted link data corresponding to each word segment includes: acquiring document number ID information of each word; and arranging the document number ID information of all the segmented words in a descending order to form the inverted chain data.
For example, taking a related paper of searching a text mining engine in a knowledge network database by a user as an example, firstly splitting a query text input by the user into a plurality of independent words such as a text, mining and an engine, then querying document number ID information of the text, mining and the engine in the first inverted index of a document library to be searched, and then arranging the queried document number ID information of all the words in a descending order to form inverted chain data corresponding to each word.
S32, finding at least one center point which is away from the second vector and meets the preset distance in the first vector index, and acquiring inverted chain data corresponding to each center point.
In a specific implementation, text retrieval is combined with vector indexes, text retrieval is performed based on the vector indexes, a first vector index of a document library to be retrieved is constructed, and text retrieval is performed based on the first vector index. Meanwhile, after receiving the online search request of the user, encoding the query text input by the user through a deep learning model to obtain a second vector, wherein the second vector is the vector of the query text input by the user. Then, at least one center point which is away from the second vector and meets the preset distance is found in the first vector index, and inverted chain data corresponding to each center point is obtained.
S33, solving an intersection of inverted chain data of each word segment to obtain a first weight value;
meanwhile, the inverted chain data corresponding to each center point is subjected to union calculation to obtain a second weight value.
For example, for several words of "text", "mining", "engine", the inverted chain data of each word is intersected to obtain a first weight value.
And at the same time, at least one center point which is away from the second vector and meets the preset distance is found in the first vector index, inverted chain data corresponding to each center point is obtained to obtain a union set, and a second weight value is obtained.
S34, comparing the first weight value with the second weight value, filtering inverted chain data with large weight value, and storing the inverted chain data into a recall intermediate result data set when preset filtering conditions are met.
And then, repeating the steps S32-S34 until the recall intermediate result data set is stopped to be collected under the condition that the recall intermediate result data set reaches a preset first storage capacity threshold value or the time for searching by using the query text exceeds a preset first time threshold value.
And S35, sequencing the recall intermediate result data set, determining the third vector, and taking the sub-text corresponding to the third vector as a target text.
In a specific implementation, all the inverted chain data stored in the recall intermediate result data set are ordered according to the score, and the inverted chain data ranked at the top is intercepted to be used as the third vector.
S4, taking the sub text corresponding to the third vector as a target text.
In this embodiment, the semantic vector search is combined with the text search, the text search is performed based on the vector search, and from the consideration of the correlation between semantic information, the recall capability of the semantic vector search in a scene combined with the text search is greatly improved, and particularly in a scene with data filtering conditions, the result of the comprehensive optimization of recall can be stabilized as the target text.
In addition, for the whole retrieval process, the retrieval time and the chain length of each piece of inverted chain data can be subjected to truncation processing, and for the retrieval time, the weight value of each word can be determined through the document number ID information of each word; then, determining recall time of each word according to the size of each word segmentation weight value; the larger the document number ID information, the more the corresponding segmentation weight value needs to be recalled.
In addition, in order to ensure timeliness of the vector indexes, the embodiment further writes the first vector index and the first inverted index into the memory segment together in advance to construct a temporary memory vector index; and then, when the temporary memory vector index reaches a preset second storage capacity threshold value or the time for constructing the temporary memory vector index reaches a preset second time threshold value, writing the first vector index and the first inverted index into a disk segment together, and constructing a persistent disk vector index. The embodiment combines the two modes of constructing the memory small-capacity cache of the temporary memory vector index and the disk large-capacity perpetual motion of the permanent disk vector index, so that for high-timeliness service, the real-time level can be realized, the flexibility and timeliness of vector index construction are improved, the semantic recall effect can be ensured, and the text retrieval accuracy is improved.
As described above, in this embodiment, the semantic vector search is combined with the text search, the text search is performed based on the vector search, after the user search request is obtained, the query text input by the user is encoded by the deep learning model to obtain the second vector, then the first vector index obtained by splitting the document library to be searched into a plurality of independent sub-texts and then encoding by the deep learning model is utilized, the third vector with the highest similarity to the second vector is queried from the first vector index, and the sub-text corresponding to the third vector is used as the target text. According to the text retrieval method, from the consideration of correlation between semantic information, the semantic recall effect of semantic vector retrieval under a scene combined with text retrieval is greatly improved, and the text retrieval performance and the accuracy of retrieval results are improved.
It should be noted that, the text retrieval method provided in the above embodiment of the present application may be applied to various terminals, such as an upper computer server, a desktop computer, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted electronic device, a wearable device, an ultra-mobile personal computer (ultra-mobile personal computer, UMPC), a netbook, a personal digital assistant (personal digital assistant, PDA), a mobile phone, and other client devices, and the embodiment of the present application is not limited specifically.
Fig. 5 is a schematic block diagram of a text retrieval device 400 according to an embodiment of the present application.
Referring to fig. 5, a block structure of a text retrieval apparatus 400 corresponds to the text retrieval method shown in fig. 3 to 4, and the text retrieval apparatus provided by the embodiment of the present application can implement each process implemented by the above text retrieval method.
As shown in fig. 5, a seed text retrieving apparatus 400 according to an embodiment of the present application includes:
an obtaining module 401, configured to obtain a user search request;
an encoding module 402, configured to encode a query text input by a user through a deep learning model to obtain a second vector;
a retrieving module 403, configured to query a third vector with highest similarity to the second vector from the first vector index,
the first vector index is obtained by splitting a document library to be retrieved into a plurality of independent sub-texts and then coding the split sub-texts by using a deep learning model;
and taking the sub text corresponding to the third vector as a target text.
Optionally, the apparatus further includes:
generating an inverted file of the document library to be retrieved;
and generating a first inverted index of the document library to be retrieved according to the inverted file.
Optionally, the querying, from the first vector index, a third vector having the highest similarity with the second vector includes:
Splitting the query text into a plurality of independent segmentations;
querying inverted chain data corresponding to each word in the first inverted index;
finding at least one center point which is at a preset distance from the second vector in the first vector index, and acquiring inverted chain data corresponding to each center point;
intersection is calculated on inverted chain data of each word segment, and a first weight value is obtained;
obtaining a union set of inverted chain data corresponding to each center point to obtain a second weight value;
comparing the first weight value with the second weight value, filtering inverted chain data with large weight value, and storing in a recall intermediate result data set when preset filtering conditions are met;
and sequencing the recall intermediate result data set, and determining the third vector.
Optionally, the intermediate result data set reaches a preset first storage capacity threshold, or
In case the time retrieved with the query text exceeds a preset first time threshold,
terminating collection of the recall intermediate result data set.
Optionally, the sorting the recall intermediate result data set, determining the third vector, includes:
And sequencing all the inverted chain data stored in the recall intermediate result data set according to the score, and intercepting the inverted chain data ranked at the top as the third vector.
Optionally, the querying inverted chain data corresponding to each word segment in the first inverted index includes:
acquiring document number ID information of each word;
and arranging the document number ID information of all the segmented words in a descending order to form the inverted chain data.
Optionally, determining a weight value of each word according to the document number ID information of each word;
determining recall time of each word segmentation according to the size of each word segmentation weight value;
and cutting off the retrieval process according to the recall time and the chain length of each piece of inverted chain data.
Optionally, the apparatus further includes a storage module 404 configured to:
writing the first vector index and the first inverted index into a memory segment together to construct a temporary memory vector index;
when the temporary memory vector index reaches a preset second storage capacity threshold, or the time for constructing the temporary memory vector index reaches a preset second time threshold,
And writing the first vector index and the first inverted index into a disk segment together to construct a persistent disk vector index.
Therefore, according to the text retrieval device 400 of the embodiment of the present application, semantic vector retrieval is combined with text retrieval, text retrieval is performed based on vector retrieval, after a user search request is obtained, a query text input by a user is encoded through a deep learning model to obtain a second vector, then a first vector index obtained by splitting a document library to be retrieved into a plurality of independent sub-texts and then encoding through the deep learning model is utilized, a third vector with highest similarity to the second vector is queried from the first vector index, and the sub-text corresponding to the third vector is used as a target text. According to the text retrieval method, from the consideration of correlation between semantic information, the semantic recall effect of semantic vector retrieval under a scene combined with text retrieval is greatly improved, and the text retrieval performance and the accuracy of retrieval results are improved.
It should be appreciated that the descriptions of the text retrieval method described above are equally applicable to the text retrieval apparatus 400 according to the embodiment of the present application, and a detailed description is not given for the sake of avoiding repetition.
Further, it should be understood that in the text retrieving device 400 according to the embodiment of the present application, only the above-described division of the respective functional modules is illustrated, and in practical applications, the above-described functional allocation may be performed by different functional modules as needed, that is, the text retrieving device may be divided into functional modules different from the above-described illustrated modules to perform all or part of the above-described functions.
Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
As shown in fig. 6, an embodiment of the present application further provides an electronic device 500, including a processor 501, a memory 502, and a program or an instruction stored in the memory 502 and capable of running on the processor 501, where the program or the instruction implements the steps of the text retrieval method described above when executed by the processor 501, and achieves the same technical effect.
Therefore, according to the electronic device 500 of the embodiment of the application, the semantic vector search is combined with the text search, so that the semantic recall effect of the semantic vector search in a scene combined with the text search is greatly improved, and the text search performance and the accuracy of the search result are improved.
Other technical effects of the electronic device 500 according to the embodiment of the present application are not described in detail herein to avoid repetition.
It should be noted that, the electronic device in the embodiment of the present application may include a mobile electronic device and a non-mobile electronic device.
Fig. 7 is a schematic diagram of a specific hardware structure of an electronic device according to an embodiment of the present application.
Referring to fig. 7, an electronic device 600 includes, but is not limited to: radio frequency unit 601, network module 602, audio output unit 603, input unit 604, sensor 605, display unit 606, user input unit 607, interface unit 608, memory 609, and processor 610.
It should be understood that, in the embodiment of the present application, the radio frequency unit 601 may be used to receive and send information or signals during a call, specifically, receive downlink data from a base station, and then process the downlink data with the processor 610; and, the uplink data is transmitted to the base station. Typically, the radio frequency unit 601 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and the like. In addition, the radio frequency unit 601 may also communicate with networks and other devices through a wireless communication system.
The electronic device 600 provides wireless broadband internet access to users, such as helping users send and receive e-mail, browse web pages, and access streaming media, through the network module 602.
The audio output unit 603 may convert audio data received by the radio frequency unit 601 or the network module 602 or stored in the memory 609 into an audio signal and output as sound. Also, the audio output unit 603 may also provide audio output (e.g., a call signal reception sound, a message reception sound, etc.) related to a specific function performed by the electronic device 600. The audio output unit 603 includes a speaker, a buzzer, a receiver, and the like.
The input unit 604 is used for receiving audio or video signals. It should be understood that in an embodiment of the present application, the input unit 604 may include a graphics processor (Graphics Processing Unit, GPU) 6041 and a microphone 6042, and the graphics processor 6041 processes image data of still pictures or video obtained by an image capturing apparatus (e.g., a camera) in a video capturing mode or an image capturing mode.
The electronic device 600 also includes at least one sensor 605, such as a light sensor, a motion sensor, and other sensors. Specifically, the light sensor includes an ambient light sensor and a proximity sensor, wherein the ambient light sensor can adjust the brightness of the display panel 6061 according to the brightness of ambient light, and the proximity sensor can turn off the display panel 6061 and/or the backlight when the electronic device 600 moves to the ear. As one of the motion sensors, the accelerometer sensor can detect the acceleration in all directions (generally three axes), and can detect the gravity and direction when stationary, and can be used for recognizing the gesture of the electronic equipment (such as horizontal and vertical screen switching, related games, magnetometer gesture calibration), vibration recognition related functions (such as pedometer and knocking), and the like; the sensor 605 may also include a fingerprint sensor, a pressure sensor, an iris sensor, a molecular sensor, a gyroscope, a barometer, a hygrometer, a thermometer, an infrared sensor, etc., which are not described herein.
The display unit 606 is used to display information input by a user or information provided to the user. The display unit 606 may include a display panel 6061, and the display panel 6061 may be configured in the form of a liquid crystal display (Liquid Crystal Display, LCD), an Organic Light-Emitting Diode (OLED), or the like.
The user input unit 607 may be used to receive input numeric or character information and to generate key signal inputs related to user settings and function control of the electronic device. Specifically, the user input unit 607 includes a touch panel 6071 and other input devices 6072. Touch panel 6071, also referred to as a touch screen, may collect touch operations thereon or thereabout by a user (e.g., operations of the user on touch panel 6071 or thereabout using any suitable object or accessory such as a finger, stylus, or the like). The touch panel 6071 may include two parts of a touch detection device and a touch controller. Other input devices 6072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and so forth, which are not described in detail herein. The interface unit 608 is an interface to which an external device is connected to the electronic apparatus 600. For example, the external devices may include a wired or wireless headset port, an external power (or battery charger) port, a wired or wireless data port, a memory card port, a port for connecting a device having an identification module, an audio input/output (I/O) port, a video I/O port, an earphone port, and the like. The interface unit 608 may be used to receive input (e.g., data information, power, etc.) from an external device and transmit the received input to one or more elements within the electronic apparatus 600 or may be used to transmit data between the electronic apparatus 600 and an external device.
The memory 609 may be used to store software programs as well as various data. The memory 609 may mainly include a storage program area that may store an operating system, an application program required for at least one function (such as a sound playing function, an image playing function, etc.), and a storage data area; the storage data area may store data (such as audio data, phonebook, etc.) created according to the use of the handset, etc. In addition, the memory 609 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.
The processor 610 is a control center of the electronic device, connects various parts of the entire electronic device using various interfaces and lines, and performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 609, and calling data stored in the memory 609, thereby performing overall monitoring of the electronic device. The processor 610 may include one or more processing units; preferably, the processor 610 may integrate an application processor that primarily handles operating systems, user interfaces, applications, etc., with a modem processor that primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 610. Those skilled in the art will appreciate that the electronic device 600 may further include a power source (e.g., a battery) for powering the various components, which may be logically connected to the processor 610 by a power management system to perform functions such as managing charge, discharge, and power consumption by the power management system. The electronic device structure shown in fig. 7 does not constitute a limitation of the electronic device, and the electronic device may include more or less components than shown, or may combine certain components, or may be arranged in different components, which are not described in detail herein. In an embodiment of the present application, the electronic device includes, but is not limited to, a mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted terminal, a wearable device (e.g., a bracelet, glasses), a pedometer, and the like.
In particular, the method comprises the steps of,
a processor 610 for:
acquiring a user search request;
encoding the query text input by the user through a deep learning model to obtain a second vector;
querying a third vector with highest similarity with the second vector from the first vector index,
the first vector index is obtained by splitting a document library to be retrieved into a plurality of independent sub-texts and then coding the split sub-texts by using a deep learning model;
and taking the sub text corresponding to the third vector as a target text.
Therefore, according to the electronic device 600 of the embodiment of the application, the semantic vector search is combined with the text search, so that the semantic recall effect of the semantic vector search in a scene combined with the text search is greatly improved, and the text search performance and the accuracy of the search result are improved.
The embodiment of the application also provides a readable storage medium, wherein the readable storage medium stores a program or instructions which, when executed by a processor, realize the steps of the text retrieval method and achieve the same technical effects. Therefore, according to the readable storage medium of the embodiment of the application, the semantic vector search is combined with the text search, so that the semantic recall effect of the semantic vector search in a scene combined with the text search is greatly improved, and the text search performance and the accuracy of the search result are improved.
Other technical effects of the readable storage medium according to the embodiments of the present application are not repeated here.
Wherein the processor is a processor in the electronic device described in the above embodiment. The readable storage medium includes a computer readable storage medium such as a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk or an optical disk, and the like.
The embodiment of the application also provides a chip, which comprises a processor and a communication interface, wherein the communication interface is coupled with the processor, and the processor is used for running programs or instructions to realize the steps of the text retrieval method and achieve the same technical effect.
Therefore, according to the chip provided by the embodiment of the application, the semantic vector search is combined with the text search, so that the semantic recall effect of the semantic vector search in a scene combined with the text search is greatly improved, and the text search performance and the accuracy of the search result are improved.
For other technical effects of the chip according to the embodiments of the present application, in order to avoid repetition, a description is omitted here.
It should be understood that the chips referred to in the embodiments of the present application may also be referred to as system-on-chip chips, chip systems, or system-on-chip chips, etc.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Furthermore, it should be noted that the scope of the methods and apparatus in the embodiments of the present application is not limited to performing the functions in the order shown or discussed, but may also include performing the functions in a substantially simultaneous manner or in an opposite order depending on the functions involved, e.g., the described methods may be performed in an order different from that described, and various steps may also be applied, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a computer software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present application.
The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are to be protected by the present application.
The application discloses a A1. A text retrieval method, comprising the following steps:
acquiring a user search request;
encoding the query text input by the user through a deep learning model to obtain a second vector;
querying a third vector with highest similarity with the second vector from the first vector index,
the first vector index is obtained by splitting a document library to be retrieved into a plurality of independent sub-texts and then coding the split sub-texts by using a deep learning model;
and taking the sub text corresponding to the third vector as a target text.
A2. The method according to A1, further comprising:
generating an inverted file of the document library to be retrieved;
and generating a first inverted index of the document library to be retrieved according to the inverted file.
A3. The method of A2, wherein querying the third vector with the highest similarity to the second vector from the first vector index includes:
splitting the query text into a plurality of independent segmentations;
querying inverted chain data corresponding to each word in the first inverted index;
finding at least one center point which is at a preset distance from the second vector in the first vector index, and acquiring inverted chain data corresponding to each center point;
intersection is calculated on inverted chain data of each word segment, and a first weight value is obtained;
obtaining a union set of inverted chain data corresponding to each center point to obtain a second weight value;
comparing the first weight value with the second weight value, filtering inverted chain data with large weight value, and storing in a recall intermediate result data set when preset filtering conditions are met;
and sequencing the recall intermediate result data set, and determining the third vector.
A4. According to the method of A3,
after the recall intermediate result data set reaches a preset first storage capacity threshold, or
In case the time retrieved with the query text exceeds a preset first time threshold,
Terminating collection of the recall intermediate result data set.
A5. The method of A3, the sorting the recall intermediate result data set, determining the third vector, comprising:
and sequencing all the inverted chain data stored in the recall intermediate result data set according to the score, and intercepting the inverted chain data ranked at the top as the third vector.
A6. The method according to A3, wherein the querying, in the first inverted index, inverted link data corresponding to each word segment includes:
acquiring document number ID information of each word;
and arranging the document number ID information of all the segmented words in a descending order to form the inverted chain data.
A7. The method according to A6, further comprising:
determining a weight value of each word according to the document number ID information of each word;
determining recall time of each word segmentation according to the size of each word segmentation weight value;
and cutting off the retrieval process according to the recall time and the chain length of each piece of inverted chain data.
A8. The method according to A2, further comprising:
writing the first vector index and the first inverted index into a memory segment together to construct a temporary memory vector index;
When the temporary memory vector index reaches a preset second storage capacity threshold, or the time for constructing the temporary memory vector index reaches a preset second time threshold,
and writing the first vector index and the first inverted index into a disk segment together to construct a persistent disk vector index.
The application also discloses a B9. text retrieval device, which comprises:
the acquisition module is used for acquiring a user search request;
the coding module is used for coding the query text input by the user through the deep learning model to obtain a second vector;
a retrieval module for querying a third vector with highest similarity with the second vector from the first vector index,
the first vector index is obtained by splitting a document library to be retrieved into a plurality of independent sub-texts and then coding the split sub-texts by using a deep learning model;
and taking the sub text corresponding to the third vector as a target text.
B10. The apparatus of B9, the index building module further to:
generating an inverted file of the document library to be retrieved;
and generating a first inverted index of the document library to be retrieved according to the inverted file.
B11. The apparatus of B10, the querying, from the first vector index, a third vector having a highest similarity to the second vector, comprising:
Splitting the query text into a plurality of independent segmentations;
querying inverted chain data corresponding to each word in the first inverted index;
finding at least one center point which is at a preset distance from the second vector in the first vector index, and acquiring inverted chain data corresponding to each center point;
intersection is calculated on inverted chain data of each word segment, and a first weight value is obtained;
obtaining a union set of inverted chain data corresponding to each center point to obtain a second weight value;
comparing the first weight value with the second weight value, filtering inverted chain data with large weight value, and storing in a recall intermediate result data set when preset filtering conditions are met;
and sequencing the recall intermediate result data set, and determining the third vector.
B12. According to the device of B11,
after the recall intermediate result data set reaches a preset first storage capacity threshold, or
In case the time retrieved with the query text exceeds a preset first time threshold,
terminating collection of the recall intermediate result data set.
B13. The apparatus of B11, the sorting the recall intermediate result data set, determining the third vector, comprising:
And sequencing all the inverted chain data stored in the recall intermediate result data set according to the score, and intercepting the inverted chain data ranked at the top as the third vector.
B14. The apparatus of B11, wherein the querying, in the first inverted index, inverted link data corresponding to each word segment includes:
acquiring document number ID information of each word;
and arranging the document number ID information of all the segmented words in a descending order to form the inverted chain data.
B15. The apparatus of B14, further comprising:
determining a weight value of each word according to the document number ID information of each word;
determining recall time of each word segmentation according to the size of each word segmentation weight value;
and cutting off the retrieval process according to the recall time and the chain length of each piece of inverted chain data.
B16. The apparatus of B10, further comprising a storage module configured to:
writing the first vector index and the first inverted index into a memory segment together to construct a temporary memory vector index;
when the temporary memory vector index reaches a preset second storage capacity threshold, or the time for constructing the temporary memory vector index reaches a preset second time threshold,
And writing the first vector index and the first inverted index into a disk segment together to construct a persistent disk vector index.
The application also discloses an electronic device comprising a processor, a memory and a program or instructions stored on the memory and executable on the processor, which when executed by the processor, implement the steps of the text retrieval method according to any one of A1-A8.
The application also discloses a readable storage medium having stored thereon a program or instructions which when executed by a processor, implements the steps of the text retrieval method according to any of A1-A8.

Claims (10)

1. A text retrieval method, comprising:
acquiring a user search request;
encoding the query text input by the user through a deep learning model to obtain a second vector;
querying a third vector with highest similarity with the second vector from the first vector index,
the first vector index is obtained by splitting a document library to be retrieved into a plurality of independent sub-texts and then coding the split sub-texts by using a deep learning model;
and taking the sub text corresponding to the third vector as a target text.
2. The method of claim 1, wherein querying a third vector from the first vector index that has a highest similarity to the second vector comprises:
splitting the query text into a plurality of independent segmentations;
querying inverted chain data corresponding to each word in a first inverted index, wherein the first inverted index is obtained according to inverted files generated by a document library to be searched;
finding at least one center point which is at a preset distance from the second vector in the first vector index, and acquiring inverted chain data corresponding to each center point;
intersection is calculated on inverted chain data of each word segment, and a first weight value is obtained;
obtaining a union set of inverted chain data corresponding to each center point to obtain a second weight value;
comparing the first weight value with the second weight value, filtering inverted chain data with large weight value, and storing in a recall intermediate result data set when preset filtering conditions are met;
and sequencing the recall intermediate result data set, and determining the third vector.
3. The method of claim 2, wherein the step of determining the position of the substrate comprises,
after the recall intermediate result data set reaches a preset first storage capacity threshold, or
In case the time retrieved with the query text exceeds a preset first time threshold,
terminating collection of the recall intermediate result data set.
4. The method of claim 2, wherein the sorting the recall intermediate result data set, determining the third vector, comprises:
and sequencing all the inverted chain data stored in the recall intermediate result data set according to the score, and intercepting the inverted chain data ranked at the top as the third vector.
5. The method according to claim 2, wherein the querying the inverted chain data corresponding to each word segment in the first inverted index includes:
acquiring document number ID information of each word;
and arranging the document number ID information of all the segmented words in a descending order to form the inverted chain data.
6. The method as recited in claim 5, further comprising:
determining a weight value of each word according to the document number ID information of each word;
determining recall time of each word segmentation according to the size of each word segmentation weight value;
and cutting off the retrieval process according to the recall time and the chain length of each piece of inverted chain data.
7. The method as recited in claim 2, further comprising:
writing the first vector index and the first inverted index into a memory segment together to construct a temporary memory vector index;
when the temporary memory vector index reaches a preset second storage capacity threshold, or the time for constructing the temporary memory vector index reaches a preset second time threshold,
and writing the first vector index and the first inverted index into a disk segment together to construct a persistent disk vector index.
8. A text retrieval apparatus, comprising:
the acquisition module is used for acquiring a user search request;
the coding module is used for coding the query text input by the user through the deep learning model to obtain a second vector;
a retrieval module for querying a third vector with highest similarity with the second vector from the first vector index,
the first vector index is obtained by splitting a document library to be retrieved into a plurality of independent sub-texts and then coding the split sub-texts by using a deep learning model;
and taking the sub text corresponding to the third vector as a target text.
9. An electronic device comprising a processor, a memory and a program or instruction stored on the memory and executable on the processor, the program or instruction when executed by the processor implementing the steps of the text retrieval method of any of claims 1 to 7.
10. A readable storage medium, characterized in that the readable storage medium has stored thereon a program or instructions which, when executed by a processor, implement the steps of the text retrieval method according to any of claims 1-7.
CN202210541779.2A 2022-05-17 2022-05-17 Text retrieval method, text retrieval device, electronic equipment and storage medium Pending CN117149990A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210541779.2A CN117149990A (en) 2022-05-17 2022-05-17 Text retrieval method, text retrieval device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210541779.2A CN117149990A (en) 2022-05-17 2022-05-17 Text retrieval method, text retrieval device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117149990A true CN117149990A (en) 2023-12-01

Family

ID=88910578

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210541779.2A Pending CN117149990A (en) 2022-05-17 2022-05-17 Text retrieval method, text retrieval device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117149990A (en)

Similar Documents

Publication Publication Date Title
US20190188275A1 (en) Method, device, storage medium, terminal for serching and retrieving applications
KR20160124182A (en) Method and apparatus for grouping contacts
CN111125523B (en) Searching method, searching device, terminal equipment and storage medium
CN114973351B (en) Face recognition method, device, equipment and storage medium
CN110674112A (en) Data query method and device and electronic equipment
CN111159338A (en) Malicious text detection method and device, electronic equipment and storage medium
CN108897846B (en) Information searching method, apparatus and computer readable storage medium
CN116070114A (en) Data set construction method and device, electronic equipment and storage medium
CN111241815A (en) Text increment method and device and terminal equipment
WO2021073434A1 (en) Object behavior recognition method and apparatus, and terminal device
CN113822038A (en) Abstract generation method and related device
CN112328783A (en) Abstract determining method and related device
CN112307198B (en) Method and related device for determining abstract of single text
CN115546516A (en) Personnel gathering method and device, computer equipment and storage medium
CN117149990A (en) Text retrieval method, text retrieval device, electronic equipment and storage medium
CN111666485B (en) Information recommendation method, device and terminal
CN113366469A (en) Data classification method and related product
CN115412726B (en) Video authenticity detection method, device and storage medium
CN113535926B (en) Active dialogue method and device and voice terminal
CN115050079B (en) Face recognition method, device and storage medium
CN115909186B (en) Image information identification method, device, computer equipment and storage medium
CN114722970B (en) Multimedia detection method, device and storage medium
CN111666421B (en) Data processing method and device and electronic equipment
WO2024036616A1 (en) Terminal-based question and answer method and apparatus
CN110909190B (en) Data searching method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication