CN114003685A - Word segmentation position index construction method and device, and document retrieval method and device - Google Patents

Word segmentation position index construction method and device, and document retrieval method and device Download PDF

Info

Publication number
CN114003685A
CN114003685A CN202210000597.4A CN202210000597A CN114003685A CN 114003685 A CN114003685 A CN 114003685A CN 202210000597 A CN202210000597 A CN 202210000597A CN 114003685 A CN114003685 A CN 114003685A
Authority
CN
China
Prior art keywords
word segmentation
participle
word
target document
index
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210000597.4A
Other languages
Chinese (zh)
Other versions
CN114003685B (en
Inventor
王峻岭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ourchem Information Consulting Co ltd
Original Assignee
Ourchem Information Consulting Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ourchem Information Consulting Co ltd filed Critical Ourchem Information Consulting Co ltd
Priority to CN202210000597.4A priority Critical patent/CN114003685B/en
Publication of CN114003685A publication Critical patent/CN114003685A/en
Application granted granted Critical
Publication of CN114003685B publication Critical patent/CN114003685B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/328Management therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Business, Economics & Management (AREA)
  • General Business, Economics & Management (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to a method and a device for constructing a word segmentation position index and a method and a device for searching documents. The method comprises the following steps: the method comprises the steps of obtaining a target document of which an index is to be built, performing word segmentation processing on specific content of the target document to obtain a word segmentation set of the target document, building a corresponding word segmentation position index for the target document according to the word segmentation set of the target document, wherein the word segmentation position index of the target document is used for recording an index value of each word segmentation in the word segmentation set of the target document, and the index value of each word segmentation is equal to an arrangement sequence value of a specific character in the word segmentation in the specific content of the target document. According to the embodiment of the application, the user can be supported to use any phrase for searching, and the corresponding content can be accurately searched.

Description

Word segmentation position index construction method and device, and document retrieval method and device
Technical Field
The present application relates to the field of data retrieval, and in particular, to a method and an apparatus for constructing a part-of-word position index, a method and an apparatus for document retrieval, a computer device, and a storage medium.
Background
In the big data era, with the rapid rise and popularization of the internet technology, the data volume generated by people in different fields is large and reaches unprecedented level. Meanwhile, the data generation, storage and processing modes are revolutionarily changed, and the work and life of people can be basically represented by digitalization, so that the adoption of an effective data retrieval mode is increasingly important.
Among various full-text retrieval systems, the elastic search (hereinafter abbreviated as ES, which is a Lucene-based search server providing a distributed multi-user-capability full-text search engine) has the characteristics of convenience, easiness in use, rapidness and high efficiency, and thus is widely used. Currently, when a sentence is segmented by an ES, the segmentation is usually performed according to a general semantic understanding. However, in the scene of retrieving patent documents, the patent documents usually contain some latest technical phrases, and the patent documents are segmented by using the current segmentation method, and the segmentation word stock may not contain the latest technical phrases, so that when a user retrieves patent documents by using the latest technical phrases, the user may not retrieve corresponding results because the segmentation word stock does not contain the latest technical phrases.
For example, for the "the present invention discloses an environment detecting system and detecting device, the device includes a detecting cartridge and a detecting device", which should be divided into: the present invention, the present invention, the present invention, the present invention, the present invention, the present, the; if the phrase "detecting cassette" is divided into several words and then the words are searched based on the divided words, although the sentence can be searched, the searching cost of this method, such as time cost, resource cost, etc., is high and the accuracy is low, i.e. the sentence desired by the user cannot be accurately hit, and the searching result contains a large amount of irrelevant content. For example, the "detecting cassette" is divided into "detecting" and "cassette" for searching, and other phrases including "detecting" and "cassette" can be searched at this time, for example, "the invention discloses a device and a detecting method, the device includes a cassette", "the invention discloses a cassette detecting method", and the like.
Disclosure of Invention
In view of the above disadvantages, the present application provides a method and an apparatus for constructing a word segmentation position index, a method and an apparatus for document retrieval, a computer device, and a storage medium.
The present application provides a method for constructing a word segmentation position index according to a first aspect, and in one embodiment, the method includes:
acquiring a target document of an index to be constructed;
performing word segmentation processing on specific content of a target document to obtain a word segmentation set of the target document;
constructing a corresponding word segmentation position index for the target document according to the word segmentation set of the target document; the word segmentation position index of the target document is used for recording an index value of each word segmentation in the word segmentation set of the target document, wherein the index value of each word segmentation is equal to the arrangement sequence value of a specific word in the word segmentation in specific content of the target document.
In one embodiment, the step of constructing a corresponding word segmentation position index for the target document according to the word segmentation set of the target document includes:
and allocating a corresponding index value to each participle in the participle set of the target document, and constructing a corresponding participle position index for the target document according to the index value of each participle.
In one embodiment, the participle set of the target document comprises a plurality of participles with the word number of 1 and a plurality of participles with the word number exceeding 1; a plurality of word segmentation with the word number of 1 is each word in the specific content of the target document; assigning a corresponding index value to each participle in the participle set of the target document, including:
when a corresponding index value is allocated to each participle with the word number of 1, taking the arrangement sequence value of each participle in the specific content of the target document as the corresponding index value;
and when the corresponding index value is allocated to each participle with the word number exceeding 1, taking the arrangement sequence value of the specific word in each participle in the specific content of the target document as the corresponding index value.
In one embodiment, the specific word is a first word or a last word.
The present application provides, according to a second aspect, a document retrieval method, which, in one embodiment, includes:
performing word segmentation processing on the obtained retrieval text to obtain a word segmentation set;
determining an index value of each participle in a participle set, wherein the index value of each participle in the participle set is equal to the arrangement sequence value of a specific character in the participle in a retrieval text;
determining the position relation of the participle set according to the index value of each participle, wherein the position relation of the participle set represents the index value difference between a specific participle in the participle set and each other participle;
and inquiring the document index according to the word segmentation set to obtain an initial result set, and screening a final result set from the initial result set according to the word segmentation set, the position relation of the word segmentation set and the word segmentation position index of each document in the initial result set.
In one embodiment, the participle set comprises n participles with the word number of 1 and m participles with the word number exceeding 1; determining an index value of each participle in the participle set, comprising:
taking the arrangement sequence value of each word number 1 in the retrieval text as a corresponding index value;
and taking the arrangement sequence value of the specific character of each participle with the character number exceeding 1 in the retrieval text as a corresponding index value.
In one embodiment, the screening out a final result set from the initial result set according to the word segmentation set, the position relationship of the word segmentation set and the word segmentation position index of each document in the initial result set includes:
inquiring the word segmentation position index of each document according to the word segmentation set to obtain word segmentation position information of each document;
checking whether each document accords with the position relation of the word segmentation set according to the word segmentation position information of each document;
and screening the documents which accord with the position relation of the participle set from the initial result set to be used as a final result set.
In one embodiment, the specific word is a first word or a last word.
In one embodiment, the particular participle in the participle set is any one participle in the participle set.
In one embodiment, the word segmentation position index of each document in the initial result set is constructed by using the word segmentation position index construction method provided by any one of the above embodiments.
The present application provides a device for constructing a word position index according to a third aspect, and in one embodiment, the device includes:
the target document acquisition module is used for acquiring a target document of the index to be constructed;
the word segmentation module is used for carrying out word segmentation processing on the specific content of the target document to obtain a word segmentation set of the target document;
the index construction module is used for constructing a corresponding word segmentation position index for the target document according to the word segmentation set of the target document; the word segmentation position index of the target document is used for recording an index value of each word segmentation in the word segmentation set of the target document, wherein the index value of each word segmentation is equal to the arrangement sequence value of a specific word in the word segmentation in specific content of the target document.
The present application provides, according to a fourth aspect, a document retrieval apparatus, which, in one embodiment, includes:
the word segmentation module is used for carrying out word segmentation processing on the obtained retrieval text to obtain a word segmentation set;
the index value determining module is used for determining the index value of each participle in the participle set, wherein the index value of each participle in the participle set is equal to the arrangement sequence value of a specific character in the participle in the search text;
the position relation determining module is used for determining the position relation of the participle set according to the index value of each participle, and the position relation of the participle set represents the index value difference between a specific participle in the participle set and each other participle;
and the retrieval module is used for querying the document index according to the word segmentation set to obtain an initial result set, and screening out a final result set from the initial result set according to the word segmentation set, the position relation of the word segmentation set and the word segmentation position index of each document in the initial result set.
According to a fifth aspect, the present application provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of any of the above-described embodiments of the method when executing the computer program.
The present application provides according to a sixth aspect a computer readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the embodiments of any of the methods described above.
In the embodiment of the application, a target document of an index to be constructed is obtained, word segmentation processing is performed on specific content of the target document to obtain a word segmentation set of the target document, a corresponding word segmentation position index is constructed for the target document according to the word segmentation set of the target document, the word segmentation position index of the target document is used for recording an index value of each word in the word segmentation set of the target document, wherein the index value of each word is equal to a specific character in the word, such as an arrangement sequence value of a first character or a last character in the specific content of the target document. According to the embodiment of the application, the user can be supported to use any phrase for searching, and the corresponding content can be accurately searched.
Drawings
FIG. 1 is a flow chart illustrating a method for constructing a word position index according to an embodiment;
FIG. 2 is a flowchart illustrating a document retrieval method according to an embodiment;
FIG. 3 is a block diagram of an apparatus for indexing a part-word position according to an embodiment;
FIG. 4 is a block diagram showing the construction of a document retrieval apparatus according to an embodiment;
FIG. 5 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The application provides a method for constructing a word segmentation position index. In one embodiment, the word segmentation position index construction method comprises the steps as shown in FIG. 1. The word segmentation position index construction method is described below with reference to fig. 1.
S110: and acquiring a target document of the index to be constructed.
The target document of the index to be constructed refers to a document which needs to construct the word segmentation position index for specific content.
The document may be a patent document, but of course, the document may also be other documents, such as academic papers, official documents, and the like.
The specific content may refer to the entire content of the target document or a part of the content in the target document, for example, taking a patent document as an example, in order to meet the requirement of a user for accurately retrieving the patent document, a participle position index may be established for some fields of the patent document, such as, but not limited to, fields including the specification, the claims, the patent name, the abstract, and the like.
S120: and performing word segmentation processing on the specific content of the target document to obtain a word segmentation set of the target document.
When performing the word segmentation process on the target document, the specific content of the target document, such as the content of the specification field, the content of the claim field, or the content of the abstract field of the patent document, is acquired to perform the word segmentation process.
Specifically, an ES (electronic search) may be used to perform a word segmentation process, that is, a specific content in a target document is divided into a plurality of words, and the divided plurality of words are a word segmentation set of the target document. The word segmentation process of the ES is the prior art, and is not described in detail in this embodiment.
The following explains a segmentation set obtained by the segmentation processing.
The participle set of the target document comprises a plurality of participles with the word number of 1 and a plurality of participles with the word number exceeding 1. The present embodiment classifies the participles in the participle set of the target document into two types, i.e., the participle with the word number of 1 and the participle with the word number of more than 1, with the dimension of whether the word number exceeds 1.
For convenience of introduction, the word segmentation with the number of 1 is referred to as single-word segmentation, and the word segmentation with the number of more than 1 is referred to as multi-word segmentation. Specifically, the number of the individual word segmentations included in the segmentation set of the target document may be equal to the number of words of the specific content of the target document, that is, each word in the specific content of the target document is divided into one segmentation.
In one example, assuming that the specific content of the target document is "the present invention discloses an environment detection system and a detection device, the device includes a detection cartridge and a detection device", which are word-segmented to obtain a word-segmented set as shown in table one.
Table one:
Figure 489224DEST_PATH_IMAGE001
as can be seen from table one, the word segmentation set contains 49 word segmentations, wherein the number of single word segmentations is 33, and the number of multi-word segmentations is 16.
S130: and constructing a corresponding word segmentation position index for the target document according to the word segmentation set of the target document.
After the word segmentation set is obtained, corresponding index values are distributed to all the words in the word segmentation set, and a mapping relation, namely a word segmentation position index, is established for all the words and the corresponding index values, so that whether all the words in the search text are continuous or not can be determined through the word segmentation position index during searching.
The word segmentation position index of the target document is used for recording an index value of each word segmentation in a word segmentation set of the document (namely, the target document), wherein the index value of each word segmentation is equal to an arrangement sequence value of a specific character in the word segmentation in specific content of the target document. The specific word is referred to as a first word or a last word.
The index value of each multi-character participle in the participle set is set as the arrangement sequence value of the specific character in the participle in the specific content of the target document, and the operation can be realized by modifying the source code of the ES.
The following describes the change of the index value of each participle in the participle set before and after the modification of the ES source code.
Table two:
Figure 203102DEST_PATH_IMAGE002
before modifying the source code of the ES, the index values assigned by the ES to the participles in table one may be as shown in table two. After modifying the source code of the ES, the index values assigned by the ES to the participles in table one may be as shown in table three.
Table three:
Figure 451680DEST_PATH_IMAGE003
in the embodiment, the problem that a specific document cannot be retrieved or accurately retrieved in the conventional ES because the latest technical phrase does not exist in the word segmentation word bank can be solved only by changing the index value of each word in the word segmentation set without adjusting the conventional word segmentation mode of the ES.
Specifically, by modifying the source code of the ES, when the ES after modifying the source code assigns corresponding index values to each participle in the participle set, the index value of each participle with the word number exceeding 1 is set as the arrangement order value of the specific word in the participle in the specific content of the target document, so that when searching, whether two participles are continuous or not can be judged by calculating the difference value between the index values of each participle split from the search text, and thus, when a user searches by using any phrase (whether the phrase has meaning in reality, such as a user inputting a keyword "box and sense", can accurately hit the document corresponding to the table three above), the document containing the any phrase can be accurately searched, and the search accuracy is improved.
In an embodiment, the step of constructing a corresponding word segmentation position index for the target document according to the word segmentation set of the target document specifically includes: and allocating a corresponding index value to each participle in the participle set of the target document, and constructing a corresponding participle position index for the target document according to the index value of each participle. The constructed word segmentation position index is shown in the third table.
Further, in an embodiment, the step of assigning a corresponding index value to each participle in the participle set of the target document specifically includes: when a corresponding index value is allocated to each participle with the word number of 1, taking the arrangement sequence value of each participle in the specific content of the target document as the corresponding index value; and when the corresponding index value is allocated to each participle with the word number exceeding 1, taking the arrangement sequence value of the specific word in each participle in the specific content of the target document as the corresponding index value.
It is understood that the word set may include repeated word segments, for example, the word set shown in table one has 4 word segments "detected". Therefore, it is necessary to assign corresponding index values to the participles in the participle set in order.
The arrangement order value of the single word segmentation in the specific content of the target document may refer to that the single word segmentation belongs to the first few words in the specific content.
In an example, taking the specific content of the target document as "the invention discloses an environment detection system and a detection device, the device comprises a detection cartridge and a detection device", and the position of each word in the specific content can be visually seen through the table four. For example, the position corresponding to the word "this" is 1, and 1 indicates that the word "this" is the 1 st word in the specific content.
Table four:
Figure 204479DEST_PATH_IMAGE004
for the multi-word segmentation, the first or last word thereof may be used as the index value, for example, the index value of "invention" in table one is the rank order value of "this" word in the specific content, i.e. 1, and the index value of "detection" in table one is the rank order value of "detection" word in the specific content, i.e. 10, taking the specific word as the first word as an example.
The application also provides a document retrieval method. In one embodiment, the document retrieval method includes the steps shown in FIG. 2. The document retrieval method is explained below with reference to fig. 2.
S210: and performing word segmentation processing on the obtained retrieval text to obtain a word segmentation set.
Wherein, the search text is a single search word in the search formula.
When receiving a search request from a user, the ES may obtain a search keyword and a search field of the user, generate a search formula, and extract a search text in the search formula to execute the document search method provided in this embodiment.
For example, a user inputs a keyword of "detection cassette" in a "patent name" column of a search page of a client, and then triggers a search instruction, the client sends a search request to the ES in response to the search instruction, where the search request may carry information such as a search field "patent name", a search word "detection cassette", and after receiving the search request, the ES extracts related information to generate a search formula, such as "patent name: detecting the cartridge ", then searching through the search formula, namely extracting a search word" detecting the cartridge "in the search formula as a search text, and then performing word segmentation processing on the" detecting cartridge "to obtain a word segmentation set, wherein the word segmentation set may include two word segmentations of" detecting "and" cartridge ". If the keyword inputted by the user is "detect cassette AND detect device", the generated search formula may be "patent name: a detection cassette AND detection device ", then ES extracts" detection cassette "AND" detection device "as search texts, i.e." detection cassette "as search texts to search corresponding result sets, AND" detection device "as search texts to search corresponding result sets, AND then obtains the result set finally fed back to the user according to the two result sets.
Further, the participle set comprises n participles with the word number of 1 and m participles with the word number exceeding 1, wherein n and m are natural numbers. For convenience of introduction, the word segmentation with the number of 1 is referred to as single-word segmentation, and the word segmentation with the number of more than 1 is referred to as multi-word segmentation.
S220: and determining the index value of each participle in the participle set, wherein the index value of each participle in the participle set is equal to the arrangement sequence value of a specific character in the participle in the retrieval text.
The index value of each participle is the arrangement order value of a specific character in the participle in the search text, namely the specific character belongs to the second character in the search text. The specific word refers to a first word or a last word.
For example, taking the search text as an "environment detection system", the word segmentation sets of the search text are "environment", "detection" and "system", and the specific word refers to the first word, then the index values corresponding to the three word segmentation sets are determined to be 1,3,5, respectively, i.e., "ring", "detection" and "system", i.e., the ranking values of the three words in the "environment detection system".
S230: and determining the position relation of the word segmentation set according to the index value of each word segmentation.
Wherein, the position relation of the participle set represents the index value difference between a specific participle and each other participle in the participle set.
Further, the specific participle may be any participle in the participle set, such as the first participle (referring to the participle with the smallest index value).
Illustratively, taking the search text as "environment detection system", taking "environment" as a specific participle, it can be calculated that the index value difference between "environment" and "detection" is 3-1=2, and the index value difference between "environment" and "system" is 5-1= 4.
S240: and inquiring the document index according to the word segmentation set to obtain an initial result set, and screening a final result set from the initial result set according to the word segmentation set, the position relation of the word segmentation set and the word segmentation position index of each document in the initial result set.
The document index may be an inverted index, and the inverted index may be constructed by a currently common construction method. And then screening a final result set from the initial result set according to the position relation of the participle set and the participle position index of each document in the initial result set.
When a document corresponds to a plurality of word segmentation position indexes, for example, patent documents correspond to word segmentation position indexes of fields such as a specification, a claim, an abstract, an invention name, and the like, the word segmentation position index of the document needs to be determined by searching the search field corresponding to the text. For example, if the search field is "patent name" in the above example, the participle position index of the document refers to the participle position index corresponding to "patent name". Further, before performing the word segmentation process on the search text, the word segmentation position index of each document in the initial result set may be constructed by using the word segmentation position index construction method provided in any of the above embodiments, where the "specific word" referred to by the document search method provided in this embodiment and the "specific word" referred to by the word segmentation position index construction method provided in any of the above embodiments correspond to each other, that is, when constructing the corresponding word segmentation position index for a document according to the word set of the document, if the index value of a word in the word segmentation set is equal to the arrangement order value of the first word (or the last word) in the word in the specific content of the document, in the above step 220, the index value of each word in the word segmentation set is equal to the arrangement order value of the first word (or the last word) in the search text.
The embodiment can perform word segmentation processing on the obtained retrieval text to obtain a word segmentation set; and then determining an index value of each participle in the participle set, wherein the index value of each participle in the participle set is equal to the arrangement sequence value of a specific character in the participle in a retrieval text, then determining the position relation of the participle set according to the index value of each participle, finally querying a document index according to the participle set to obtain an initial result set, and screening a final result set from the initial result set according to the participle set, the position relation of the participle set and the participle position index of each document in the initial result set. The user can use any phrase to search, and can accurately search out corresponding content.
In an embodiment, the step of determining the index value of each participle in the participle set specifically includes: taking the position of each word number 1 in the retrieval text as a corresponding index value; and taking the arrangement sequence value of the specific character of each participle with the character number exceeding 1 in the retrieval text as a corresponding index value. For example, taking the case that the search text is "cartridge and detection", the word set is "cartridge", "and" detection ", and the index values corresponding to these three words are 1,3, and 4, respectively.
Further, in an embodiment, the step of screening out the final result set from the initial result set according to the term set, the position relationship of the term set, and the term position index of each document in the initial result set specifically includes:
inquiring the word segmentation position index of each document according to the word segmentation set to obtain word segmentation position information of each document;
checking whether each document accords with the position relation of the word segmentation set according to the word segmentation position information of each document;
and screening the documents which accord with the position relation of the participle set from the initial result set to be used as a final result set.
The above-described embodiment is explained below by way of an example.
In one example, assuming that the search field specified by the user is the abstract, the search text extracted by the ES is the "detection device", the word set is the "detection" and the "device", the specific word is the first word, the index value difference is 3-1=2, the content of the abstract field of a certain document a is "the invention discloses an environment detection system and detection device, the device comprises a detection cartridge and a detection device", and the word position index of the document a is as shown in table three.
Firstly, inquiring the participle position index of the document A according to the participle set, namely 'detection' and 'equipment', so as to obtain the index values corresponding to the participles in the participle set in the participle position index of the document A, namely the index values corresponding to the participles 'detection' are respectively 10, 15, 25 and 30, the index values corresponding to the participles 'equipment' are respectively 17 and 21,
then, the first index value of the participle "detect" and the first index value of the participle "device" are taken, the index value difference between the two is calculated, if the index value difference between the two is equal to 2, the two are continuous, at this time, the document A can be confirmed to be in accordance with the position relation of the participle set, the document A can be selected into the final result set, and if the index value difference between the two is greater than 2, the document A is not continuous, the next index value of the "detect" is continuously taken, and the step of comparing the index values is continuously executed. In this example, when the index value corresponding to the participle "detect" is 15 and the index value corresponding to the participle "device" is 17, the difference between the two index values corresponds to the position relationship of the participle set.
It should be noted that, with respect to the steps included in the word segmentation position index construction method provided in any one of the above embodiments, unless explicitly stated otherwise herein, the steps are not strictly limited in order of execution, and may be executed in other orders. Moreover, at least some of the steps may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least some of the sub-steps or stages of other steps.
Based on the same inventive concept, the application also provides a device for constructing the index of the position of the part of speech. In this embodiment, as shown in fig. 3, the word segmentation position index construction device includes the following modules:
a target document obtaining module 110, configured to obtain a target document of an index to be constructed;
a word segmentation module 120, configured to perform word segmentation processing on specific content of the target document to obtain a word segmentation set of the target document;
the index building module 130 is configured to build a corresponding word segmentation position index for the target document according to the word segmentation set of the target document; the word segmentation position index of the target document is used for recording an index value of each word segmentation in the word segmentation set of the target document, wherein the index value of each word segmentation is equal to the arrangement sequence value of a specific word in the word segmentation in specific content of the target document.
In one embodiment, the specific word is a first word or a last word.
In one embodiment, the index construction module is configured to execute the following steps to construct a corresponding word segmentation position index for a target document according to a word segmentation set of the target document:
and allocating a corresponding index value to each participle in the participle set of the target document, and constructing a corresponding participle position index according to the fact that the index value of each participle is equal to that of the target document.
In one embodiment, the participle set of the target document comprises a plurality of participles with the word number of 1 and a plurality of participles with the word number exceeding 1; a plurality of word segmentation with the word number of 1 is each word in the specific content of the target document; correspondingly, when the index building module is configured to assign a corresponding index value to each participle in the participle set of the target document, the index building module is specifically configured to:
when a corresponding index value is allocated to each participle with the word number of 1, taking the arrangement sequence value of each participle in the specific content of the target document as the corresponding index value;
and when the corresponding index value is allocated to each participle with the word number exceeding 1, taking the arrangement sequence value of the specific word in each participle in the specific content of the target document as the corresponding index value.
For specific limitations of the segmentation position index construction device, reference may be made to the above limitations of the segmentation position index construction method, which are not described herein again. All or part of each module in the word segmentation position index construction device can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
Based on the same inventive concept, the application also provides a document retrieval device. In the present embodiment, as shown in fig. 4, the document retrieval apparatus includes the following modules:
a word segmentation module 210, configured to perform word segmentation processing on the obtained search text to obtain a word segmentation set;
an index value determining module 220, configured to determine an index value of each participle in the participle set, where the index value of each participle in the participle set is equal to an arrangement order value of a specific word in the participle in the search text;
a position relation determining module 230, configured to determine a position relation of the participle set according to the index value of each participle, where the position relation of the participle set indicates an index value difference between a specific participle in the participle set and each other participle;
and the retrieval module 240 is configured to query the document indexes according to the word segmentation sets to obtain an initial result set, and screen out a final result set from the initial result set according to the word segmentation sets, the position relationships of the word segmentation sets, and the word segmentation position indexes of each document in the initial result set.
In one embodiment, the specific word is a first word or a last word.
In one embodiment, the participle set comprises n participles with the word number of 1 and m participles with the word number exceeding 1; accordingly, the index value determining module is specifically configured to:
taking the arrangement sequence value of each word number 1 in the retrieval text as a corresponding index value;
and taking the arrangement sequence value of the specific character of each participle with the character number exceeding 1 in the retrieval text as a corresponding index value.
In one embodiment, the retrieval module is specifically configured to:
inquiring the word segmentation position index of each document according to the word segmentation set to obtain word segmentation position information of each document;
checking whether each document accords with the position relation of the word segmentation set according to the word segmentation position information of each document;
and screening the documents which accord with the position relation of the participle set from the initial result set to be used as a final result set.
In one embodiment, the particular participle in the participle set is any one participle in the participle set.
In one embodiment, the participle position index of each document in the initial result set is constructed using the participle position index construction method provided in any of the above embodiments.
For the specific limitations of the document retrieval device, reference may be made to the above limitations of the document retrieval method, which are not described herein again. The modules in the document retrieval device can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, the internal structure of which may be as shown in FIG. 5.
The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing data such as indexes, and the specific stored data can also be referred to as the definition in the above method embodiment. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method for constructing a word position index.
Those skilled in the art will appreciate that the architecture shown in fig. 5 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
The present embodiment also provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the steps included in the method provided in any of the above method embodiments are implemented.
In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, is adapted to carry out the steps of the method provided in any of the above-mentioned method embodiments.
It will be understood by those skilled in the art that all or part of the processes of the embodiments of the methods described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the computer program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), memory bus (Rambus), direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (14)

1. A method for constructing a word position index is characterized by comprising the following steps:
acquiring a target document of an index to be constructed;
performing word segmentation processing on the specific content of the target document to obtain a word segmentation set of the target document;
constructing a corresponding word segmentation position index for the target document according to the word segmentation set of the target document; the word segmentation position index of the target document is used for recording an index value of each word segmentation in the word segmentation set of the target document, wherein the index value of each word segmentation is equal to the arrangement sequence value of a specific character in the word segmentation in specific content of the target document.
2. The method of claim 1, wherein the step of constructing a corresponding word segmentation position index for the target document based on the set of words segmentation for the target document comprises:
and allocating a corresponding index value to each word in the word segmentation set of the target document, and constructing a corresponding word segmentation position index for the target document according to the index value of each word.
3. The method of claim 2, wherein the set of participles of the target document comprises a plurality of participles having a word count of 1 and a plurality of participles having a word count exceeding 1; the participles with the word number of 1 are each word in the specific content of the target document;
the allocating a corresponding index value to each participle in the participle set of the target document includes:
when a corresponding index value is allocated to each participle with the word number of 1, taking the arrangement sequence value of each participle in the specific content of the target document as the corresponding index value;
and when the corresponding index value is allocated to each participle with the word number exceeding 1, taking the arrangement sequence value of the specific word in each participle in the specific content of the target document as the corresponding index value.
4. The method of claim 1, wherein the specific word is a first word or a last word.
5. A method of document retrieval, the method comprising:
performing word segmentation processing on the obtained retrieval text to obtain a word segmentation set;
determining an index value of each participle in the participle set, wherein the index value of each participle in the participle set is equal to the arrangement sequence value of a specific character in the participle in the retrieval text;
determining the position relation of the participle set according to the index value of each participle, wherein the position relation of the participle set represents the index value difference between a specific participle in the participle set and each other participle;
and inquiring document indexes according to the word segmentation set to obtain an initial result set, and screening a final result set from the initial result set according to the word segmentation set, the position relation of the word segmentation set and the word segmentation position index of each document in the initial result set.
6. The method of claim 5, wherein the set of tokens includes n tokens having a word count of 1 and m tokens having a word count exceeding 1;
the determining an index value of each participle in the participle set includes:
taking the arrangement sequence value of each word number 1 in the retrieval text as a corresponding index value;
and taking the arrangement sequence value of the specific character of each participle with the character number exceeding 1 in the retrieval text as a corresponding index value.
7. The method of claim 5 or 6, wherein the filtering out a final result set from the initial result set according to the set of parts-of-words, the positional relationship of the set of parts-of-words, and the word-segmentation positional index of each document in the initial result set comprises:
inquiring the word segmentation position index of each document according to the word segmentation set to obtain word segmentation position information of each document;
checking whether each document accords with the position relation of the word segmentation set or not according to the word segmentation position information of each document;
and screening out the documents which accord with the position relation of the word segmentation set from the initial result set as a final result set.
8. The method of claim 5, wherein the specific character in any one of the participles is a first character or a last character in the participle.
9. The method of claim 5, wherein the specific word is a first word or a last word.
10. The method of claim 5, wherein the participle position index for each document in the initial result set is constructed using the participle position index construction method of any one of claims 1-4.
11. A device for constructing a word position index, the device comprising:
the target document acquisition module is used for acquiring a target document of the index to be constructed;
the word segmentation module is used for carrying out word segmentation processing on the specific content of the target document to obtain a word segmentation set of the target document;
the index construction module is used for constructing a corresponding word segmentation position index for the target document according to the word segmentation set of the target document; the word segmentation position index of the target document is used for recording an index value of each word segmentation in the word segmentation set of the target document, wherein the index value of each word segmentation is equal to the arrangement sequence value of a specific character in the word segmentation in specific content of the target document.
12. A document retrieval apparatus, characterized in that the apparatus comprises:
the word segmentation module is used for carrying out word segmentation processing on the obtained retrieval text to obtain a word segmentation set;
an index value determining module, configured to determine an index value of each participle in the participle set, where the index value of each participle in the participle set is equal to an arrangement order value of a specific word in the participle in the search text;
the position relation determining module is used for determining the position relation of the participle set according to the index value of each participle, and the position relation of the participle set represents the index value difference between a specific participle in the participle set and each other participle;
and the retrieval module is used for querying document indexes according to the word segmentation set to obtain an initial result set, and screening a final result set from the initial result set according to the word segmentation set, the position relation of the word segmentation set and the word segmentation position index of each document in the initial result set.
13. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 10 are implemented by the processor when executing the computer program.
14. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 10.
CN202210000597.4A 2022-01-04 2022-01-04 Word segmentation position index construction method and device, and document retrieval method and device Active CN114003685B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210000597.4A CN114003685B (en) 2022-01-04 2022-01-04 Word segmentation position index construction method and device, and document retrieval method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210000597.4A CN114003685B (en) 2022-01-04 2022-01-04 Word segmentation position index construction method and device, and document retrieval method and device

Publications (2)

Publication Number Publication Date
CN114003685A true CN114003685A (en) 2022-02-01
CN114003685B CN114003685B (en) 2022-06-07

Family

ID=79932547

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210000597.4A Active CN114003685B (en) 2022-01-04 2022-01-04 Word segmentation position index construction method and device, and document retrieval method and device

Country Status (1)

Country Link
CN (1) CN114003685B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115098617A (en) * 2022-06-10 2022-09-23 杭州未名信科科技有限公司 Method, device and equipment for labeling triple relation extraction task and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101620607A (en) * 2008-07-01 2010-01-06 全国组织机构代码管理中心 Full-text retrieval method and full-text retrieval system
CN102236697A (en) * 2010-04-27 2011-11-09 卡西欧计算机株式会社 Searching apparatus and searching method
CN102541960A (en) * 2010-12-31 2012-07-04 北大方正集团有限公司 Method and device of fuzzy retrieval
CN102567421A (en) * 2010-12-27 2012-07-11 北大方正集团有限公司 Document retrieval method and device
CN108804642A (en) * 2018-06-05 2018-11-13 中国平安人寿保险股份有限公司 Search method, device, computer equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101620607A (en) * 2008-07-01 2010-01-06 全国组织机构代码管理中心 Full-text retrieval method and full-text retrieval system
CN102236697A (en) * 2010-04-27 2011-11-09 卡西欧计算机株式会社 Searching apparatus and searching method
CN102567421A (en) * 2010-12-27 2012-07-11 北大方正集团有限公司 Document retrieval method and device
CN102541960A (en) * 2010-12-31 2012-07-04 北大方正集团有限公司 Method and device of fuzzy retrieval
CN108804642A (en) * 2018-06-05 2018-11-13 中国平安人寿保险股份有限公司 Search method, device, computer equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115098617A (en) * 2022-06-10 2022-09-23 杭州未名信科科技有限公司 Method, device and equipment for labeling triple relation extraction task and storage medium

Also Published As

Publication number Publication date
CN114003685B (en) 2022-06-07

Similar Documents

Publication Publication Date Title
CN108804641B (en) Text similarity calculation method, device, equipment and storage medium
JP7073576B2 (en) Association recommendation method, equipment, computer equipment and storage media
CN110321408B (en) Searching method and device based on knowledge graph, computer equipment and storage medium
CN109657137B (en) Public opinion news classification model construction method, device, computer equipment and storage medium
CN110674319A (en) Label determination method and device, computer equipment and storage medium
CN111177405A (en) Data search matching method and device, computer equipment and storage medium
CN112560444A (en) Text processing method and device, computer equipment and storage medium
CN111737981A (en) Vocabulary error correction method and device, computer equipment and storage medium
CN112685475A (en) Report query method and device, computer equipment and storage medium
CN114003685B (en) Word segmentation position index construction method and device, and document retrieval method and device
CN111382570B (en) Text entity recognition method, device, computer equipment and storage medium
CN110555165B (en) Information identification method and device, computer equipment and storage medium
CN114222000B (en) Information pushing method, device, computer equipment and storage medium
CN110825840A (en) Word bank expansion method, device, equipment and storage medium
CN111368061A (en) Short text filtering method, device, medium and computer equipment
CN111241811B (en) Method, apparatus, computer device and storage medium for determining search term weight
CN116303968A (en) Semantic search method, device, equipment and medium based on technical keyword extraction
CN110781310A (en) Target concept graph construction method and device, computer equipment and storage medium
CN110795617A (en) Error correction method and related device for search terms
CN115796176A (en) Word segmentation processing method, computer device, storage medium, and computer program product
CN114169331A (en) Address resolution method, device, computer equipment and storage medium
CN112650914A (en) Long-tail keyword identification method, keyword search method and computer equipment
EP1076305A1 (en) A phonetic method of retrieving and presenting electronic information from large information sources, an apparatus for performing the method, a computer-readable medium, and a computer program element
CN117743558B (en) Knowledge processing and knowledge question-answering method, device and medium based on large model
CN112559671B (en) ES-based text search engine construction method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant