CN112883165B

CN112883165B - Intelligent full-text retrieval method and system based on semantic understanding

Info

Publication number: CN112883165B
Application number: CN202110281426.9A
Authority: CN
Inventors: 吴士伟; 杨春; 李慧娟; 孙露; 孙浩; 辛国茂; 胡传会
Original assignee: Shandong Ecloud Information Technology Co ltd
Current assignee: Shandong Ecloud Information Technology Co ltd
Priority date: 2021-03-16
Filing date: 2021-03-16
Publication date: 2022-12-02
Anticipated expiration: 2041-03-16
Also published as: CN112883165A

Abstract

The invention discloses an intelligent full-text retrieval method and system based on semantic understanding, which comprises the following steps: cutting the received search sentence into short texts, and performing word segmentation operation on the short texts to obtain word segmentation libraries corresponding to the short texts; constructing a semantic information vector and a dependency relationship vector of the short text; the semantic information vector comprises a central word and a word sense co-occurrence word of the short text; and based on the semantic information vector and the dependency relationship vector of the short text, performing similarity calculation on the short text information and the related information in the intelligent index library to further obtain a search result set. According to the method, the original data are divided into a plurality of short texts to form the search text vector, and the similarity between the search text and the index library text is calculated by calling the semantic understanding interface of the artificial intelligence platform, so that the accuracy of full-text retrieval can be improved.

Description

Intelligent full-text retrieval method and system based on semantic understanding

Technical Field

The invention relates to the technical field of natural language processing, in particular to an intelligent full-text retrieval method and system based on semantic understanding.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

The full text search takes various data, such as characters, voice, images and the like as processing objects, provides a means for realizing information search according to the content of data materials rather than external characteristics, and comprises two functions: data management and data query help users to manage and retrieve large amounts of document data quickly.

Lucene is currently an open source item of Apache and is also currently the most popular Java-based open source full network search toolkit. Lucene realizes some general word segmentation algorithms, reserves a plurality of lexical analyzer interfaces for users, and can be conveniently embedded into various applications to realize the full-text retrieval function of the applications. The retrieval essence of the Lucene still belongs to index retrieval, full-text indexing is carried out on files and characters needing to be retrieved, the indexing is quickly retrieved during retrieval to obtain a retrieval position, the position is associated with a document path where a retrieval word appears, and the Lucene returns a retrieval result to a user.

The data volume of the big data era is increased sharply, and with the development of microblogs, forums, jitters and other media and society, the retrieval effect is to be improved with the increase of a plurality of new words and data volumes. The reason is that the conventional full-text retrieval divides original data into words by word segmentation, links keywords with all documents containing the keywords by an inverted index mode, and often finds and returns the documents containing the searched keywords quickly when a user searches, so that the documents are only mechanically matched from the font, and much information which represents the same concept but expresses different characters is omitted, namely the keywords cannot be understood from the semantics. For example, in a city like four seasons in spring, a user wants to obtain the cities such as Kunming, xiamen, mingming and the like, but the conventional full-text search only matches articles with keywords such as the keywords of four seasons, the keywords of the city and the like according to the keywords, and the real requirements of the user are difficult to meet.

In addition, most of search fields of full-text retrieval are short texts, the classification mode of short text information is different from the classification process of the traditional long text due to the uniqueness of the short text information, and scholars perform a series of researches on the problems of data sparsity, overhigh latitude, insufficient semantic information and the like of the short text; the prior art applies a Deep Neural Network (DNN) method to a classification study of short texts, which has a certain effect, but still faces some challenges, such as: most short text classification models only consider the literal meaning, have poor recognition effect on ubiquitous polysemous words, and cannot overcome the defect of sparsity of short texts.

Disclosure of Invention

In order to solve the problems, the invention provides an intelligent full-text retrieval method and system based on semantic understanding.

In some embodiments, the following technical scheme is adopted:

an intelligent full-text retrieval method based on semantic understanding comprises the following steps:

cutting the received search sentence into short texts, and performing word segmentation operation on the short texts to obtain word segmentation libraries corresponding to the short texts;

constructing a semantic information vector and a dependency relationship vector of the short text; the semantic information vector comprises a central word and a word sense co-occurrence word of the short text;

and based on the semantic information vector and the dependency relationship vector of the short text, performing similarity calculation on the short text information and related information in the intelligent index library to further obtain a search result set.

As a further scheme, obtaining a word segmentation library corresponding to the short text specifically includes:

matching periods or question marks through regular expressions, and cutting the long text into short texts; and performing word segmentation operation on the short text by combining the stop word bank to form a word segmentation bank corresponding to the short text.

As a further scheme, the name of the short text includes the belonging long text flag.

As a further scheme, constructing a semantic information vector and a word dependency relationship vector of a short text specifically includes:

extracting a central word and a word sense co-occurrence word of the short text; the attributes of the core word, the word sense co-occurrence word and the word sense co-occurrence word together form a semantic information vector of the short text;

and obtaining a syntactic dependency relationship tree through syntactic dependency analysis based on the central word and the semantic co-occurrence word in the semantic information vector library to form a word dependency relationship vector of the short text.

As a further scheme, the intelligent index library comprises: the short text, the word segmentation library corresponding to the short text, the semantic information vector and the dependency relationship vector of the short text.

As a further scheme, the similarity calculation of the searched short text information and the related information in the intelligent index library specifically includes:

calculating the similarity between the searched short text and each short text central word in the intelligent index library;

calculating the number of the searched short texts and the number of the short text dependency relationship trees with the same semantic dependency relationship in each short text dependency relationship tree in the intelligent index library;

calculating the similarity of the core words corresponding to the same semantic dependency relationship in the short text and each short text dependency relationship tree in the intelligent index library;

and calculating similarity of the searched short text and the words extracted from the words with more total word sense co-occurring words in each short text in the intelligent index database.

And as a further scheme, adding the similarity scores obtained by calculation, sorting the similarity scores from large to small according to the total score after addition, and returning to the search result set.

In other embodiments, the following technical solutions are adopted:

an intelligent full-text retrieval system based on semantic understanding, comprising:

the data preprocessing module is used for cutting the received search sentences into short texts and performing word segmentation operation on the short texts to obtain word segmentation libraries corresponding to the short texts;

the short text vector construction module is used for constructing a semantic information vector and a dependency relationship vector of the short text; the semantic information vector comprises a central word and a word sense co-occurrence word of the short text;

and the data indexing module is used for carrying out similarity calculation on the short text information and the related information in the intelligent index library based on the semantic information vector and the dependency relationship vector of the short text so as to obtain a search result set.

In other embodiments, the following technical solutions are adopted:

a terminal device comprising a processor and a memory, the processor being arranged to implement instructions; the memory is used for storing a plurality of instructions which are suitable for being loaded by the processor and executing the intelligent full-text retrieval method based on semantic understanding.

In other embodiments, the following technical solutions are adopted:

a computer-readable storage medium having stored therein a plurality of instructions adapted to be loaded by a processor of a terminal device and to execute the above intelligent full-text retrieval method based on semantic understanding.

Compared with the prior art, the invention has the beneficial effects that:

(1) The word sense co-occurrence words refer to words with similar word senses in the short text, and the extraction of the word sense co-occurrence words can extract the semantics of the short text more quickly and accurately.

The method can analyze the syntactic dependency relationship, extract the core words through the syntactic dependency tree, calculate the similarity of the core words, help to judge the similarity of the short texts, and assist to calculate the similarity among the short texts by means of the same number of the dependency relationships, thereby improving the accuracy of full-text retrieval.

Additional features and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

FIG. 1 is a flow chart of an intelligent full-text retrieval method based on semantic understanding in an embodiment of the present invention;

FIG. 2 is a diagram illustrating a dependency syntax structure according to an embodiment of the present invention.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

Example one

In one or more embodiments, an intelligent full-text retrieval method based on semantic understanding is disclosed, and with reference to fig. 1, the method comprises the following processes:

(1) Cutting the received search sentence into short texts, and performing word segmentation operation on the short texts to obtain word segmentation libraries corresponding to the short texts;

(2) Constructing a semantic information vector and a dependency relationship vector of the short text; the semantic information vector comprises a central word and a word sense co-occurrence word of the short text;

(3) And based on the semantic information vector and the dependency relationship vector of the short text, performing similarity calculation on the short text information and related information in the intelligent index library to further obtain a search result set.

Specifically, the process of constructing the intelligent index library in this embodiment specifically includes:

the original data is stored in the Mongo database, most of the original data is long text, and because the long text sentence is too long, the semantic is complex, and the semantic features are difficult to extract, the long text is processed into the short text in the embodiment.

The method specifically comprises the following steps:

firstly, matching periods through regular expressions, and cutting a long text into short texts; then artificially enriching and deactivating a word bank; and finally, performing word segmentation operation on the short text by applying an ltp _ data tool of the Hadamard-language cloud and combining with the stop word bank to form a word segmentation bank corresponding to the short text. The naming of the short text comprises the belonging long text mark, so as to mark the provenance of the short text.

Stop Words refer to that in information retrieval, in order to save storage space and improve search efficiency, some characters or Words are automatically filtered before or after processing natural language data (or text), and the characters or Words are called Stop Words. The stop words are manually input and are not automatically generated, and the generated stop words form a stop word list, namely a stop word library.

Such as: one piece of raw data in the Mongo data is as follows:

the above raw data is processed as short text as follows:

the first half of the ID is the ID of the long text, and the long text corresponding to the short text can be identified.

After the short text and the word segmentation library thereof are obtained, because only the word segmentation library can not realize semantic understanding, semantic processing needs to be performed on the short text, including semantic information vector construction and dependency relationship vector construction.

The construction of the semantic information vector comprises the following steps: extracting the central word of the short text through a natural language processing interface; extracting word sense co-occurrence words of the short text through a natural language processing interface; the core word, the word sense co-occurrence word and the co-occurrence word attribute jointly form a semantic information vector of the short text. Such as:

and constructing a dependency relationship vector based on the central word and the semantic co-occurrence word in the semantic information vector library, calling a syntactic dependency analysis interface, and forming a word dependency relationship vector of the short text by the obtained syntactic dependency relationship tree.

Referring to fig. 2, the dependency syntax parses a sentence into a dependency syntax tree, describing the dependency relationship between words. The grammar structure with predicates as the center takes verbs as the center words of sentences, other components in the sentences are all governed by the center verbs, and all governed components depend on the governors with certain dependence relationship. Such as:

referring to fig. 1, the contents in the intelligent index library include: the short text, the word segmentation library corresponding to the short text, the semantic information vector and the dependency relationship vector of the short text.

After the intelligent index library is constructed, for a text needing to be searched, cutting a received search sentence into a short text, and performing word segmentation operation on the short text to obtain a word segmentation library corresponding to the short text; then respectively constructing semantic information vectors and dependency relationship vectors of the short texts; finally, a short text with complete semantics of the search text is obtained as follows:

and then calling an artificial intelligent text similarity algorithm model to carry out similarity calculation on the obtained search text with complete semantics and the intelligent index library.

The similarity calculation includes four aspects of calculation:

calculating the number of the searched short texts and the number of the same semantic dependency relations in each short text dependency relation tree in the intelligent index library;

and calculating the similarity of the words extracted from the words with more word sense co-occurrence words in the short texts and each short text in the intelligent index library.

The similarity calculation in the above four aspects all obtains a similarity score, the four scores are added, and the four scores are sorted from high to low according to the order of the scores, and a short text result set is returned, for example, the returned results are as follows:

and the value in the frame is the unique identifier of the corresponding long text, and the original data is taken out according to the unique identifier and returned to the user.

Example two

In one or more embodiments, disclosed is a semantic understanding-based intelligent full-text retrieval system, comprising:

The specific implementation of each module is described in detail in the first embodiment, and is not described herein again.

EXAMPLE III

In one or more implementations, a terminal device is disclosed, which includes a server including a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the processor implements the intelligent full-text retrieval method based on semantic understanding in the first embodiment. For brevity, further description is omitted herein.

It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate arrays FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may include both read-only memory and random access memory, and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.

In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or by instructions in the form of software.

The intelligent full-text retrieval method based on semantic understanding in the first embodiment can be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in the processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here.

Those of ordinary skill in the art will appreciate that the various illustrative elements, i.e., algorithm steps, described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

Example four

In one or more implementations, a computer-readable storage medium having stored therein a plurality of instructions adapted to be loaded by a processor of a terminal device and implementing a semantic understanding-based intelligent full-text retrieval method as described in the first embodiment is disclosed.

Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.

Claims

1. An intelligent full-text retrieval method based on semantic understanding is characterized by comprising the following steps:

the word sense co-occurrence words refer to words with similar word senses in the short text; extracting word sense co-occurrence words of the short text through a natural language processing interface;

based on the semantic information vector and the dependency relationship vector of the short text, similarity calculation is carried out on the short text information and relevant information in an intelligent index library, and then a search result set is obtained;

and performing similarity calculation on the short text information and the related information in the intelligent index library, wherein the similarity calculation comprises calculating similarity of searching the short text and extracting one word from words with more total word sense co-occurrence words in each short text in the intelligent index library.

2. The intelligent full-text retrieval method based on semantic understanding of claim 1, wherein obtaining the word segmentation library corresponding to the short text specifically comprises:

matching periods or questions through regular expressions, and cutting the long text into short text; and performing word segmentation operation on the short text by combining the stop word bank to form a word segmentation bank corresponding to the short text.

3. The intelligent full-text retrieval method based on semantic understanding of claim 2, wherein the naming of the short text comprises the belonging long text mark.

4. The intelligent full-text retrieval method based on semantic understanding as claimed in claim 1, wherein constructing semantic information vectors and word dependency relationship vectors of short texts specifically comprises:

5. The intelligent full-text retrieval method based on semantic understanding according to claim 1, wherein the intelligent index library comprises: the short text, the word segmentation library corresponding to the short text, the semantic information vector and the dependency relationship vector of the short text.

6. The intelligent full-text retrieval method based on semantic understanding according to claim 1, wherein the similarity calculation is performed between the searched short text information and the related information in the intelligent index library, and further comprising:

and calculating the similarity of the core words corresponding to the same semantic dependency relationship in the short text and each short text dependency relationship tree in the intelligent index library.

7. The intelligent full-text retrieval method based on semantic understanding of claim 6, wherein the similarity scores obtained by calculation are added, and are sorted from large to small according to the total score after addition, and a search result set is returned.

8. An intelligent full-text retrieval system based on semantic understanding, comprising:

the short text vector construction module is used for constructing a semantic information vector and a dependency relationship vector of a short text; the semantic information vector comprises a central word and a word sense co-occurrence word of the short text;

the data index module is used for carrying out similarity calculation on the short text information and the related information in the intelligent index library based on the semantic information vector and the dependency relationship vector of the short text so as to obtain a search result set;

and performing similarity calculation on the short text information and the related information in the intelligent index library, wherein the similarity calculation comprises calculating the similarity of the short text information and words extracted from the words with more word sense co-occurrence words in each short text in the intelligent index library.

9. A terminal device comprising a processor and a memory, the processor being arranged to implement instructions; the memory is used for storing a plurality of instructions, wherein the instructions are suitable for being loaded by the processor and executing the intelligent full text retrieval method based on semantic understanding according to any one of claims 1-7.

10. A computer-readable storage medium having stored thereon a plurality of instructions, wherein the instructions are adapted to be loaded by a processor of a terminal device and to perform the intelligent full text retrieval method based on semantic understanding according to any one of claims 1 to 7.