CN114661852A - Text searching method, terminal and readable storage medium - Google Patents

Text searching method, terminal and readable storage medium Download PDF

Info

Publication number
CN114661852A
CN114661852A CN202011544265.XA CN202011544265A CN114661852A CN 114661852 A CN114661852 A CN 114661852A CN 202011544265 A CN202011544265 A CN 202011544265A CN 114661852 A CN114661852 A CN 114661852A
Authority
CN
China
Prior art keywords
vocabulary
text
searched
words
specific
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011544265.XA
Other languages
Chinese (zh)
Inventor
杰·戈亚尔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Oneplus Technology Shenzhen Co Ltd
Original Assignee
Oneplus Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oneplus Technology Shenzhen Co Ltd filed Critical Oneplus Technology Shenzhen Co Ltd
Priority to CN202011544265.XA priority Critical patent/CN114661852A/en
Publication of CN114661852A publication Critical patent/CN114661852A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3338Query expansion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a text searching method, a terminal and a readable storage medium, wherein the method comprises the steps of obtaining a vocabulary to be searched; acquiring related words related to semantics and/or grammar of the words to be searched, wherein the words to be searched and the related words are keywords of the words to be searched; and searching in each text in a text library by using the keywords, and acquiring the text containing at least one keyword.

Description

Text search method, terminal and readable storage medium
Technical Field
The application relates to the technical field of computers, in particular to a text searching method, a terminal and a readable storage medium.
Background
With the continuous expansion of the storage space of the terminal, the stored data is more and more, the user is difficult to obtain the target data from a large amount of data in a mode of inquiring one by one, and because the existing form adopted by the data generally comprises a text form, the word to be searched in a search bar is proposed, the text containing the word to be searched is screened and pushed to the user, and the target data is obtained by searching the text, so that the probability of obtaining the target data from the terminal by the user is improved.
In the traditional searching method, after the vocabulary input by the user is obtained, the searching is usually carried out according to the vocabulary to be searched, and if each text in the terminal does not contain the vocabulary to be searched, the target data cannot be obtained.
Disclosure of Invention
Based on the above, the application provides a text search method and device, a terminal and a storage medium, which can improve the probability of searching target data.
In a first aspect, a text search method is provided, which includes the following steps:
acquiring a vocabulary to be searched;
acquiring related words related to semantics and/or grammar of the words to be searched, wherein the words to be searched and the related words are keywords of the words to be searched;
and searching in each text in a text library by using the keywords, and acquiring the text containing at least one keyword.
In one embodiment, the text includes a file name of each file in the terminal, the step of obtaining each text in the terminal and searching in each text by using the keyword of the vocabulary to be searched includes: and acquiring each file in the terminal, acquiring the file name of each file in the terminal, searching in each file name by using the keywords of the vocabulary to be searched, and acquiring the file name containing at least one keyword and the corresponding file.
In one embodiment, the text in each file in the terminal includes a text in each file in the terminal, and the step of searching in each text by using the keyword of the vocabulary to be searched includes: and acquiring each file of the terminal, acquiring texts in each file in the terminal, searching the texts in each file by using the keywords of the vocabulary to be searched, and acquiring the texts containing at least one keyword and the corresponding files.
In one embodiment, after the step of obtaining the text containing at least one keyword, the method includes:
and displaying the text containing at least one keyword on a display interface, and marking the appearing keyword.
In one embodiment, the method further comprises the following steps of creating a matrix table:
acquiring an initial matrix table constructed for a specific vocabulary; each specific vocabulary and associated vocabulary thereof in the initial matrix table have a pairing mapping relation, and the associated vocabulary corresponding to the specific vocabulary lacking the associated vocabulary is empty;
counting the number of specific vocabularies with associated vocabularies in the matrix table;
under the condition that the number is less than a preset number value, for a specific vocabulary lacking related vocabularies in the matrix table, obtaining the related vocabularies corresponding to the specific vocabulary by using a word embedding model, and writing the obtained related vocabularies into the matrix table until the number of the specific vocabulary with related vocabularies reaches the preset number;
the step of obtaining the associated vocabulary related to the semantic and/or grammar of the vocabulary to be searched comprises the following steps:
and acquiring the associated vocabulary from the matrix table according to the vocabulary to be searched.
In one embodiment, the step of creating the matrix table is performed in an off-line manner.
In one embodiment, the step of obtaining the associated vocabulary of each specific vocabulary by using the word embedding model includes:
obtaining a corpus;
searching the position where each specific vocabulary appears from each document in the corpus, and acquiring context words in a preset window of the specific vocabulary at each position;
and determining the weight of each context word, and outputting the context words with the weight larger than the preset weight as related words related to the semantics/grammar of the specific words.
In one embodiment, the method further comprises the following steps:
acquiring documents in at least two terminals;
searching the position of the specific vocabulary from each document, and acquiring the context words in different test windows of the specific vocabulary at each position aiming at different test windows;
determining the weight of each context word of each specific vocabulary in different test windows, and outputting the context word with the weight greater than the preset weight as a related vocabulary related to the semantics/grammar of the specific vocabulary;
and comparing the number of the associated vocabularies of each specific vocabulary in different test windows, and the association degree of each associated vocabulary and the specific vocabulary in different test windows, and taking the test window with the association degree higher than a preset threshold value and/or the number of the associated vocabularies higher than the preset number as the preset window.
In one embodiment, the step of searching each text in the text library by using the keyword includes:
and under the condition that the texts are not searched by using the vocabulary to be searched, the number of the searched texts is less than the preset number or the condition that the texts searched by using the vocabulary to be searched do not belong to the target text is detected, searching each text in the text library by using the associated vocabulary.
In a second aspect, a terminal is proposed, which is characterized in that the terminal comprises a memory and a processor, wherein the memory stores a computer program thereon, and the computer program, when executed by the processor, implements the steps of the method as set forth in any of the above embodiments
In a third aspect, a readable storage medium is proposed, wherein a computer program is stored on the readable storage medium, and the computer program, when executed by a processor, implements the steps of the method as described in any of the above embodiments.
According to the text searching method, the terminal and the readable storage medium, the associated vocabulary related to the semanteme and/or grammar of the vocabulary to be searched is obtained, then the keywords of the vocabulary to be searched comprise the vocabulary to be searched and the associated vocabulary thereof, and then the texts containing at least one keyword are obtained by searching in each text by utilizing each keyword of the vocabulary to be searched, so that the searching range can be expanded, and the probability of obtaining target data is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic structural diagram of a terminal in an embodiment of the present application;
fig. 2 is a schematic flowchart of a text search method according to an embodiment of the present application;
FIG. 3 is a diagram illustrating an effect of a text search method in an embodiment of the present application;
FIG. 4 is a diagram illustrating the effect of locating an associated paragraph in an embodiment of the present application;
FIG. 5 is a diagram illustrating an effect of a text search method in another embodiment of the present application;
FIG. 6 is a diagram illustrating an effect of displaying keywords in a markup document name according to an embodiment of the present application;
FIG. 7 is a diagram illustrating the effect of displaying keywords in sentence segments in markup documents according to another embodiment of the present application;
FIG. 8 is a diagram illustrating training of context words using a word embedding model in an embodiment of the present application;
fig. 9 is a schematic structural diagram of a search apparatus in an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application are clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. The following embodiments and their technical features may be combined with each other without conflict.
Please refer to fig. 1, which is a schematic diagram illustrating an internal structure of a terminal according to an embodiment of the present application. The terminal includes a processor, a memory, and a network interface connected by a system bus. Wherein, the processor is used for providing calculation and control capability and supporting the operation of the whole terminal. The memory is used for storing data, programs and the like, and at least one computer program is stored on the memory and can be executed by the processor to realize the text-through search method suitable for the terminal provided by the embodiment of the application. The memory may include non-volatile storage media and internal memory. The non-volatile storage medium stores an operating system and a computer program. The computer program can be executed by a processor to implement a text search method provided in the following embodiments. The internal memory provides a cached execution environment for the operating system computer programs in the non-volatile storage medium. The network interface may be an ethernet card or a wireless network card, and is used for communicating with an external terminal.
Those skilled in the art will appreciate that the terminal configuration shown in fig. 1 is not intended to be limiting, and that the terminal may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
The terminal described in the present application may include mobile terminals such as a mobile phone, a tablet computer, a notebook computer, a palmtop computer, a Personal Digital Assistant (PDA), a Portable Media Player (PMP), a navigation device, a wearable device, a smart band, a pedometer, and the like, and fixed terminals such as a Digital TV, a desktop computer, and the like.
The following description will be given taking a mobile terminal as an example, and it will be understood by those skilled in the art that the configuration according to the embodiment of the present application can be applied to a fixed type terminal in addition to elements particularly used for mobile purposes.
In the related art, after words input by a user are obtained, searching is usually performed according to the words to be searched, and if each text in a terminal does not contain the words to be searched, target data cannot be searched.
The embodiment of the application provides a text searching method which can improve the probability of searching target data. The target data may be a target file, a target sentence segment, etc., such as a sentence or segment within a document, an instant message, etc. When the text search method in the embodiment of the present application is used to obtain data buffered or stored in the terminal, the text search method may be executed in an offline manner. The text search method of the present application is described below by taking a smart phone as an example.
Please refer to fig. 2, which is a flowchart illustrating a text search method according to an embodiment of the present application, where the text search method includes the following steps:
step 202, acquiring a vocabulary to be searched;
step 204, obtaining related words related to the semantic and/or grammar of the words to be searched, and taking the words to be searched and the related words as keywords of the words to be searched;
step 206, each text in the text library is searched by using the keywords, and a text containing at least one keyword is obtained.
According to the text searching method in the embodiment of the application, associated words related to the words to be searched in semantic meaning and/or grammar are obtained, then the keywords of the words to be searched comprise the words to be searched and the associated words, and then the texts containing at least one keyword are obtained by searching in each text by utilizing each keyword of the words to be searched, so that the searching range can be expanded, and the probability of obtaining target data is improved.
The vocabulary to be searched is the vocabulary input by the user to the terminal, and the vocabulary to be searched comprises at least one character or word, such as dog, fox, green, tractor, and the like. The input method includes but is not limited to handwriting input, input method input, voice input and the like. The input words may be words, words and/or phrases. The input vocabulary can be one or more, and when the vocabulary to be searched is more, the search range is favorably expanded.
In some embodiments, when there are a plurality of words to be searched, in order to reduce the probability of searching noise, the associated words of the words to be searched having the characterization capability may be obtained for searching, instead of the associated words of all the words to be searched, which is beneficial to reduce the probability of searching noise. For example, as shown in fig. 3, a user needs to view own train ticket information, so that a "train ticket of me" is input in a search field, and then a keyword of "train ticket" with a characterization capability is used to search, so that search noise is effectively reduced.
In addition, in some embodiments, when the number of the words to be searched is multiple, in order to further reduce the probability of searching noise, when a keyword of each word to be searched appears in one text, the text is described as a valid text, and the text is obtained. Further, when the distance of the interval of the keywords of each word to be searched, for example, the length of the byte of the interval, is smaller than a preset value, and/or when the occurrence sequence of each keyword is detected to be consistent with the input sequence of the corresponding word to be searched, the text is acquired, and the probability of searching noise can be further reduced. For example, as shown in fig. 4, two words to be searched, namely "life" and "precious" are input in the search bar, and by the text search method of the present application, only the word associated with "life" and the word associated with "precious" are searched, namely "precious" and "life" are detected in a paragraph as shown in fig. 5, and it is further detected that "life" and "precious" are closer to each other.
The text search method of the embodiment of the application is used for obtaining target data, and may be specifically used for obtaining a target file, where the text includes a file name of each file and/or characters in the file, the text may also be an instant message such as an email message, a WeChat message, a short message, and the like, and the file is a document or a picture, and is not limited herein. Documents may or may not be editable and remain within the scope of the present application. When the files are documents or photographs containing words, the text of each file may include the words within the document or photograph. The Document refers to a Portable Document Format Document (pdf Document), a presentation Document (Power Point, PPT Document), a word Document, a text Document (TXT Document), and the like, and the specific description is not limited herein. The characters may be in any language, such as chinese, english, japanese, korean, latin, etc., and are not limited herein.
The text library may refer to a folder of the terminal, which includes a plurality of files therein, wherein at least one file contains text.
In some possible embodiments, when the text includes a file name of each file in the terminal, step 206 may include: and acquiring each file in the terminal, acquiring the file name of each file in the terminal, searching in each file name by using the keywords of the vocabulary to be searched, and acquiring the file name containing at least one keyword and the corresponding file.
For example, as shown in fig. 3, it is a schematic diagram of the effect of searching for a file according to an embodiment. In this embodiment, the user needs to check his own train ticket information, so "train ticket" is input in the search field, and since the terminal file has no file named "train ticket", the train ticket cannot be obtained by the conventional text search method, but by the text search method of the present application, because the file name "railway" is the associated vocabulary of "train ticket", the pdf file named "railway" is searched, and the file is the train ticket required by the user.
In other possible embodiments, when the respective text includes text in a respective file in the terminal, step 206 may include: and acquiring each file of the terminal, acquiring texts in each file in the terminal, searching the texts in each file by using the keywords of the vocabulary to be searched, and acquiring the texts containing at least one keyword and the corresponding files.
For example, as shown in fig. 5, in this embodiment, the user needs to search for a file, the user does not remember the file name, but the user remembers that a certain sentence or a certain fragment in the file is probably described ". human beings are precious", but does not remember the real expression vocabulary in the file, so "human beings" and "precious" are input in the search field, and through the text search method of the present application, a paragraph shown in fig. 4 is searched for a pdf document containing the relevant vocabulary of "human life" - "precious" and "dying" - "precious" and finally the pdf document corresponding to the file "meaning of life" is searched for, as shown in fig. 5.
In other possible embodiments, when the text includes both the file name of each file in the terminal and the text in the file, and when the text and/or the file name in the file contain at least one keyword of a vocabulary to be searched, the corresponding file is acquired.
In some embodiments, when the text search method of the embodiment of the present application is used to obtain a target document, the document may be obtained by an inverted index method or the like. Specifically, texts in each document in the terminal can be obtained, an inverted index method is adopted, each keyword of the vocabulary to be searched is utilized to search the texts in each document, and a text containing at least one keyword and a corresponding document are obtained.
The inverted index, also often referred to as inverted index, posting archive or inverted archive, is an indexing method, and in such embodiments, the inverted index is used to store a mapping of the storage location of a certain keyword in a document or a group of documents under a full-text search, which may be in the form of a table. By the reverse index method, the target document containing the keywords can be quickly acquired. For example, as shown in table 1, the table is a mapping relation table between the keywords of each vocabulary to be searched and the documents, which is searched by the inverted index method.
TABLE 1
Keyword Document numbering comprising keywords
House 221 (number of document 1), 231 (number of document 2),
Work by 231
Green colour 221. 223 (number of document 3)
Switch with a switch body 231
Queen of women 223
... ...
After the step of obtaining the text containing at least one keyword in step 206, the method may include the steps of displaying the text containing at least one keyword on a display interface, and marking the appearing keyword. The searched file name and/or text in the file containing at least one keyword and the corresponding file icon can be displayed on a display interface, so that the user interaction is facilitated. The manner of marking may be at least one of bolding, underlining, changing color, changing font, etc.
For example, still taking the example of train ticket, as shown in fig. 6, the file name, "railway of me" and the file icon are shown on the display interface, but only the keyword "railway" of "ticket" is marked, and other words are not marked.
For another example, still taking the life meaning as an example, as shown in fig. 7, the file name "" railway of me "and the file icon are shown on the display interface, but only the keyword" railway "of" ticket "is marked, and other words are not marked.
Specifically, when a certain sentence or segment in the file is searched for by the keyword, the position of the keyword may be further acquired when the keyword is searched for, and then the position is skipped to and displayed.
In some possible embodiments, when there are a plurality of texts containing at least one keyword of the vocabulary to be searched, the texts are sorted according to the association degree with the vocabulary to be searched, the keywords with higher association degree are more in the front of the sorting of the corresponding texts on the display interface, and/or the number of the keywords containing the vocabulary to be searched, the text with the largest number is counted, and the text with the highest number is more in the front of the sorting of the display interface. Or counting the times of the keywords of the vocabulary to be searched, wherein the more times of the texts, the more front the sequence of the texts in the display interface is.
In some possible embodiments, the text search method implemented by the present application further includes a step of establishing correspondence between a predetermined number of words and associated words. This step can be performed offline in the terminal. The corresponding relation between the predetermined number of words and the associated words can be established through the created matrix table, and the method comprises the following steps:
acquiring an initial matrix table constructed for a specific vocabulary; each specific vocabulary and associated vocabulary thereof in the initial matrix table have a pairing mapping relation, and the associated vocabulary corresponding to the specific vocabulary lacking the associated vocabulary is empty;
counting the number of specific vocabularies with associated vocabularies in the matrix table;
under the condition that the number is less than a preset number value, for a specific vocabulary lacking related vocabularies in the matrix table, obtaining the related vocabularies corresponding to the specific vocabulary by using a word embedding model, and writing the obtained related vocabularies into the matrix table until the number of the specific vocabulary with related vocabularies reaches the preset number;
the step of obtaining the associated vocabulary related to the existence semantics and/or grammar of the vocabulary to be searched comprises the following steps:
and acquiring the related vocabulary from the matrix table according to the vocabulary to be searched.
In specific implementation, the matrix table creation includes the following steps: firstly, acquiring a predetermined number of specific vocabularies, and acquiring an initial matrix table constructed for the predetermined number of specific vocabularies; each specific vocabulary and the associated vocabulary in the initial matrix table have a pairing mapping relation, and for the specific vocabulary lacking the associated vocabulary, the associated vocabulary cell is initialized to be empty; then traversing the specific vocabulary in the matrix table, counting the specific vocabulary of the associated vocabulary, obtaining the associated vocabulary of the specific vocabulary lacking the associated vocabulary by using a word embedding model, writing the associated vocabulary into the corresponding cell, and counting; when the counting value reaches a preset number value, all the specific words are indicated to have corresponding associated words, and the creation of the matrix table is completed.
The Word embedding model, also called Word2Vec model, is a neural network model. In NLP (Natural Language Processing), the finest granularity is words, which form sentences, which form paragraphs, chapters, and documents. Words are in symbol form, such as chinese, english, latin, etc., and therefore they need to be converted into numerical form, that is, each Word is embedded into a mathematical space, which is called Word embedding (Word embedding), and Word2vec is a Word embedding expression, which means that a Word is converted into a corresponding vector expression form to read data by a machine.
As shown in fig. 8, the word embedding model in this embodiment includes a logical classifier (i.e., a straight line with an arrow in the figure) and a word vector converter (i.e., a rectangular box in the figure), where words in each text are converted into word vectors by the word vector converter, and the words are also converted into word vectors after inputting a specific word into the word embedding model, and then the logical classifier is used to adjust the weight of each contextual word in the window.
The method comprises the steps of obtaining relevant words by using a word embedding model, specifically converting a specific word into an array input word embedding model, outputting a plurality of word vectors by using the word embedding model, wherein words corresponding to the word vectors are relevant words of the specific word. The predetermined number of particular words may be words in an oxford dictionary.
In this embodiment, the number of the specific vocabulary with the associated vocabulary in the matrix table is counted first, and when the number is smaller than a predetermined number, for the specific vocabulary lacking the associated vocabulary in the matrix table, the associated vocabulary corresponding to the specific vocabulary is obtained by using a word embedding model, and the obtained associated vocabulary is written into the matrix table.
The following description further describes the creation of a particular vocabulary matrix table by way of a specific example.
The specific example is to create a matrix table of 273000 specific vocabularies in the oxford dictionary, and table 2 is an initial matrix obtained, and it can be derived from table 2 that some specific vocabularies have associated vocabularies, and some associated vocabularies of the specific vocabularies are empty.
TABLE 2
Vocabulary and phrases Context word
Fox Fast brown jump
.. ..
Government <Air conditioner>
.. ..
Queen of women Queen princess queen principal
And then, traversing the initial matrix, counting the specific words of the associated words, obtaining the associated words of the specific words lacking the associated words by using a word embedding model, writing the associated words into the cells, and counting until the count value reaches 273000 as shown in table 3, thereby completing the creation process of the whole matrix.
TABLE 3
Serial number Vocabulary and phrases Context word
1 Aircraft with a flight control device Bombing sound of airplane flight ticket
.. .. ..
.. .. ..
.. .. ..
100 Fox Fast brown skip
.. ..
.. ..
271000 Queen of women Queen seat princess queen principal
273000 .. ..
In some possible embodiments, the training process for obtaining the associated vocabulary of each specific vocabulary by using the word embedding model includes the following steps, and the training process can be performed in an off-line manner:
acquiring a corpus;
searching the position where the specific vocabulary appears from each document in the corpus, and acquiring context words in a preset window of the specific vocabulary at each position;
and determining the weight of each context word, and outputting the context words with the weight larger than the preset weight as related words related to the semanteme/grammar of the specific words.
In the embodiment, the context words with the weight greater than the preset weight are output as the associated words related to the specific words in the semantic/grammar, so that the semantic/grammar association degree of the output associated words and the specific words can be improved, and the search noise can be reduced.
When the corpus is updated, the training process can be executed again to change the weight of the context words. Corpus refers to a large-scale electronic text library that is sampled and processed.
The determining the weight of each contextual word may determine the weight of each contextual word according to the frequency of occurrence of each contextual word, and specifically includes: when a certain specific vocabulary appears at multiple positions in a document, the frequency of the upper and lower words in the window of the specific vocabulary at each position can be obtained, and the higher the frequency is, the larger the weight is adjusted, and the smaller the frequency is, the smaller the weight is adjusted.
Specifically, it is also possible to identify a vocabulary having no classification ability from a specific vocabulary in the context words, such as prepositions and conjunctions, to reduce the weight thereof, and to increase the weight of a vocabulary having a classification ability.
The window referred to in the embodiments of the present application may be byte-long, for example, with the position of a specific word as the center, and words within 10 bytes are all regarded as context words of the specific word. The number of words closest to the specific vocabulary may be, for example, n/2 words before and after the specific vocabulary, respectively, are regarded as context words of the specific vocabulary with the position of the specific vocabulary as the center, and n may be, for example, 10 according to the setting set by the user.
After the corpus is obtained, preprocessing the documents in the corpus to standardize the texts in the documents, specifically comprising: reading in a text; the method comprises the steps of removing punctuation marks, case unification and the like from each sentence in a text, dividing the sentence into a plurality of words so as to convert the text into word sequences, then establishing a dictionary, and mapping each word in the dictionary to a unique index so as to convert the text from the word sequences into the index sequences. The index may be defined as an ID number, with each vocabulary having a unique ID number.
After the documents in the corpus are preprocessed, irrelevant words such as conjunctions, prepositions, quantifications and the like can be deleted from the text; and then, a specific vocabulary is input to search context words in a preset window, so that the relevancy and the training efficiency of the obtained relevant vocabulary are improved. On the premise of the same preset window size, compared with the condition that irrelevant words are not deleted, the number of the obtained context words can be increased by the specific embodiment.
In one embodiment, the window size may be obtained as follows:
acquiring documents in at least two terminals;
searching the position of the specific vocabulary from each document, and acquiring the context words in different test windows of the specific vocabulary at each position aiming at different test windows;
determining the weight of each context word of each specific vocabulary in different test windows, and outputting the context word with the weight larger than the preset weight as a related vocabulary related to the semantics/grammar of the specific vocabulary;
and comparing the number of the associated vocabularies of each specific vocabulary in different test windows, the association degree of each associated vocabulary and the specific vocabulary in different test windows, and taking the test window with the association degree higher than a preset threshold value and/or the number of the associated vocabularies higher than the preset number as the preset window.
The inventor selects 5 terminals of different users, tests 500 documents in total, and finds that the training speed, the number of associated words and the association degree of specific words and associated words can be considered when the preset window value is within the range of 5-9 bytes in length.
In some possible embodiments, in step 206, the step of searching each text in the text library by using the keyword includes: and under the condition that the texts are not searched by using the vocabulary to be searched, the number of the searched texts is less than the preset number or the condition that the texts searched by using the vocabulary to be searched do not belong to the target text is detected, searching each text in the text library by using the associated vocabulary of the vocabulary to be searched.
The text is searched by using the vocabulary to be searched, for example, the target text, namely the text required by the user, is searched, or a certain amount of texts are searched by using the vocabulary to be searched, the search can be remitted without using the associated words of the vocabulary to be searched, otherwise, the steps in the embodiments are executed, so that the output of the noise text can be reduced, and the search amount can be reduced.
In some possible embodiments, the at least one keyword in step 206 refers to more than one of the vocabulary to be searched and all related vocabularies of the vocabulary to be searched. For example, the word to be searched is "train", the associated word of the train includes railway, motor train and high-speed rail, and then the train, railway, motor train and high-speed rail are all the keywords of the "train", and then the text including any more than one word of the train, railway, motor train and high-speed rail is obtained.
It should be understood that, although the steps in the flowchart of fig. 2 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 2 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
It should be noted that, step numbers such as 202, 204, etc. are used herein for the purpose of more clearly and briefly describing the corresponding content, and do not constitute a substantial limitation on the sequence, and those skilled in the art may perform step number 204 and then step number 202 in the specific implementation, which should be within the protection scope of the present application.
As shown in fig. 9, it is a block diagram of a structure of a text search apparatus according to an embodiment. The text search apparatus 900 includes:
a vocabulary to be searched acquisition module 910, configured to acquire a vocabulary to be searched;
an associated vocabulary acquiring module 920, configured to acquire an associated vocabulary related to the vocabulary to be searched for by semantics and/or syntax, and use the vocabulary to be searched and the associated vocabulary as keywords of the vocabulary to be searched;
the searching module 930 is configured to search in each text in the text library by using the keyword of the vocabulary to be searched, and obtain a text including at least one keyword.
The division of each module in the text search apparatus is only for illustration, and in other embodiments, the text search apparatus may be divided into different modules as needed to complete all or part of the functions of the text search apparatus.
For the specific limitations of the text search device, reference may be made to the above limitations of the text search method, which is not described herein again. The respective modules in the text search device described above may be implemented in whole or in part by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
The implementation of each module in the text search apparatus provided in the embodiments of the present application may be in the form of a computer program. The computer program may be run on a terminal or a server. The program modules constituted by the computer program may be stored on the memory of the terminal or the server. Which when executed by a processor, performs the steps of the method described in the embodiments of the present application.
The embodiment of the application also provides a computer readable storage medium. One or more non-transitory computer-readable storage media embodying computer-executable instructions that, when executed by one or more processors, cause the processors to perform the steps of the method described in any of the embodiments.
A computer program product comprising instructions which, when run on a computer, cause the computer to perform the method as described in any of the above embodiments.
Any reference to memory, storage, database, or other medium used herein may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM), which acts as external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), Rambus (Rambus) direct RAM (RDRAM), direct bused dynamic RAM (DRDRAM), and bused dynamic RAM (RDRAM).
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is specific and detailed, but not to be understood as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, and these are all within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A text search method, comprising the steps of:
acquiring a vocabulary to be searched;
obtaining related words related to semantics and/or grammar of the words to be searched, and taking the words to be searched and the related words as keywords;
and searching each text in a text library by using the keywords, and acquiring the text containing at least one keyword.
2. The method according to claim 1, wherein each text comprises a file name of each file in the terminal, and the step of searching each text in the terminal by using the keywords of the vocabulary to be searched comprises: acquiring each file in a terminal, acquiring the file name of each file in the terminal, searching in each file name by using the keywords of the vocabulary to be searched, and acquiring the file name containing at least one keyword and a corresponding file;
or, the steps of obtaining each text in the terminal and searching in each text by using the keywords of the vocabulary to be searched include: and acquiring each file of the terminal, acquiring texts in each file in the terminal, searching the texts in each file by using the keywords of the vocabulary to be searched, and acquiring the texts containing at least one keyword and the corresponding files.
3. The method according to claim 1 or 2, wherein said step of obtaining text containing at least one keyword is followed by:
and displaying the text containing at least one keyword on a display interface, and marking the appearing keyword.
4. The method according to claim 1 or 2, further comprising the step of creating a matrix table by:
acquiring an initial matrix table constructed for a specific vocabulary; each specific vocabulary and associated vocabulary thereof in the initial matrix table have a pairing mapping relation, and the associated vocabulary corresponding to the specific vocabulary lacking the associated vocabulary is empty;
counting the number of specific vocabularies with associated vocabularies in the matrix table;
under the condition that the number is less than a preset number value, for a specific vocabulary lacking related vocabularies in the matrix table, obtaining the related vocabularies corresponding to the specific vocabulary by using a word embedding model, and writing the obtained related vocabularies into the matrix table until the number of the specific vocabulary with related vocabularies reaches the preset number;
the step of obtaining the associated vocabulary related to the semantic and/or grammar of the vocabulary to be searched comprises the following steps:
and acquiring the associated vocabulary from the matrix table according to the vocabulary to be searched.
5. The method of claim 4, wherein the step of creating a matrix table is performed in an off-line manner.
6. The method of claim 4, wherein the step of using the word embedding model to obtain the associated vocabulary of each specific vocabulary comprises:
obtaining a corpus;
searching the position where each specific vocabulary appears from each document in the corpus, and acquiring context words in a preset window of the specific vocabulary at each position;
determining the weight of each context word, and outputting the context words with the weight larger than the preset weight as associated words related to the semantics/grammar of the specific words;
wherein the weight of each context word is determined according to the frequency of occurrence of each context word.
7. The method of claim 6, further comprising:
acquiring documents in at least two terminals;
searching the position of the specific vocabulary from each document, and acquiring the context words in different test windows of the specific vocabulary at each position aiming at different test windows;
determining the weight of each context word of each specific vocabulary in different test windows, and outputting the context word with the weight greater than the preset weight as a related vocabulary related to the semantics/grammar of the specific vocabulary;
and comparing the number of the associated vocabularies of each specific vocabulary in different test windows, and the association degree of each associated vocabulary and the specific vocabulary in different test windows, and taking the test window with the association degree higher than a preset threshold value and/or the number of the associated vocabularies higher than the preset number as the preset window.
8. The method according to claim 1 or 2, wherein the step of searching each text in the text library by using the keyword comprises:
and under the condition that the texts are not searched by using the vocabulary to be searched, the number of the searched texts is smaller than the preset number or the condition that the texts searched by using the vocabulary to be searched do not belong to the target text is detected, searching each text in the text library by using the associated vocabulary.
9. A terminal, characterized in that the terminal comprises a memory, a processor, wherein the memory has stored thereon a computer program which, when executed by the processor, carries out the steps of the method according to any one of claims 1 to 8.
10. A readable storage medium, characterized in that a computer program is stored thereon which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.
CN202011544265.XA 2020-12-23 2020-12-23 Text searching method, terminal and readable storage medium Pending CN114661852A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011544265.XA CN114661852A (en) 2020-12-23 2020-12-23 Text searching method, terminal and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011544265.XA CN114661852A (en) 2020-12-23 2020-12-23 Text searching method, terminal and readable storage medium

Publications (1)

Publication Number Publication Date
CN114661852A true CN114661852A (en) 2022-06-24

Family

ID=82024406

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011544265.XA Pending CN114661852A (en) 2020-12-23 2020-12-23 Text searching method, terminal and readable storage medium

Country Status (1)

Country Link
CN (1) CN114661852A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117891839A (en) * 2024-03-14 2024-04-16 福建省政务门户网站运营管理有限公司 Intelligent retrieval method and system
CN117891839B (en) * 2024-03-14 2024-06-07 福建省政务门户网站运营管理有限公司 Intelligent retrieval method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117891839A (en) * 2024-03-14 2024-04-16 福建省政务门户网站运营管理有限公司 Intelligent retrieval method and system
CN117891839B (en) * 2024-03-14 2024-06-07 福建省政务门户网站运营管理有限公司 Intelligent retrieval method and system

Similar Documents

Publication Publication Date Title
US11403680B2 (en) Method, apparatus for evaluating review, device and storage medium
US11675977B2 (en) Intelligent system that dynamically improves its knowledge and code-base for natural language understanding
CN108287858B (en) Semantic extraction method and device for natural language
CN108717406B (en) Text emotion analysis method and device and storage medium
US9223779B2 (en) Text segmentation with multiple granularity levels
CN108509482B (en) Question classification method and device, computer equipment and storage medium
WO2021189951A1 (en) Text search method and apparatus, and computer device and storage medium
KR20110083623A (en) Machine learning for transliteration
US20090300003A1 (en) Apparatus and method for supporting keyword input
CN113076431A (en) Question and answer method and device for machine reading understanding, computer equipment and storage medium
CN113688954A (en) Method, system, equipment and storage medium for calculating text similarity
CN113434636A (en) Semantic-based approximate text search method and device, computer equipment and medium
CN113722438A (en) Sentence vector generation method and device based on sentence vector model and computer equipment
CN111357015B (en) Text conversion method, apparatus, computer device, and computer-readable storage medium
CN113553439A (en) Method and system for knowledge graph mining
CN111160007B (en) Search method and device based on BERT language model, computer equipment and storage medium
CN116881425A (en) Universal document question-answering implementation method, system, device and storage medium
CN111444712B (en) Keyword extraction method, terminal and computer readable storage medium
CN112527954A (en) Unstructured data full-text search method and system and computer equipment
CN109684357B (en) Information processing method and device, storage medium and terminal
CN108920452B (en) Information processing method and device
CN104572628B (en) A kind of science based on syntactic feature defines automatic extraction system and method
CN114661852A (en) Text searching method, terminal and readable storage medium
CN114842982A (en) Knowledge expression method, device and system for medical information system
CN112800771B (en) Article identification method, apparatus, computer readable storage medium and computer device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination