CN112527954A

CN112527954A - Unstructured data full-text search method and system and computer equipment

Info

Publication number: CN112527954A
Application number: CN202011398749.8A
Authority: CN
Inventors: 高静; 谢国栋; 庄之中
Original assignee: Wuhan United Imaging Healthcare Co Ltd
Current assignee: Wuhan United Imaging Healthcare Co Ltd
Priority date: 2020-12-03
Filing date: 2020-12-03
Publication date: 2021-03-19

Abstract

The application relates to a method, a system and computer equipment for full-text search of unstructured data. The method comprises the following steps: analyzing the unstructured file into semi-structured information, vectorizing the semi-structured information and key information of full-text search to obtain a vectorized result, and recalling according to the vectorized result to determine a target full-text search result; according to the method, the unstructured file does not need to be opened, the unstructured file does not need to be converted into the structured file, the complex process that the index information corresponding to the target text information expected to be searched is obtained firstly, and then the target text information is extracted from the unstructured file according to the index information is executed, the unstructured file can be analyzed into the semi-structured file, then intelligent search direct obtaining is further achieved, the complexity of the unstructured data full-text search method operation is reduced, and the unstructured data search efficiency is improved.

Description

Unstructured data full-text search method and system and computer equipment

Technical Field

The present application relates to the field of information search technologies, and in particular, to a method, a system, and a computer device for full text search of unstructured data.

Background

At present, vertical search is widely applied to websites such as various portals, electronic commerce and the like, and structured data full-text search is mainly adopted to provide more vertical and visual search service for users, so that the users can gradually, quickly and accurately acquire information required by the users on a certain website. For enterprises and public institutions and national government agencies, the number of unstructured documents (such as pdf documents, doc documents and ppt documents) is large, and therefore, a complete set of solutions for full-text search of unstructured data needs to be provided.

In the conventional technology, an unstructured file is stored in a webpage end, full text search of unstructured data is realized by preview search or search of a corresponding format software program, index information corresponding to target text information expected to be searched is acquired, and further the target text information is extracted from the unstructured file through the index information. However, the traditional unstructured data full text searching method is complex in operation, and the searching efficiency of unstructured data is low.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a method, a system and a computer device for searching unstructured data in a full text manner, which can improve the efficiency of searching unstructured data.

A method of full-text searching of unstructured data, the method comprising:

analyzing the unstructured file into semi-structured information;

vectorizing the semi-structured information and the key information of full-text search to obtain a vectorization result;

and recalling according to the vectorization result to determine a target full-text search result.

In one embodiment, the vector result includes vectorization unit information and vectorization key information;

the vectorizing processing is performed on the semi-structured information and the key information of the full-text search to obtain a vectorized result, and the vectorizing processing comprises the following steps:

vectorizing the semi-structured information to obtain vectorized unit information;

and vectorizing the key information of the full-text search to obtain the vectorized key information.

In one embodiment, the vectorizing the semi-structured information to obtain vectorized unit information includes:

extracting different unit information in the semi-structured information;

and performing vectorization processing according to different unit information to obtain the vectorization unit information.

In one embodiment, the performing vectorization processing according to different unit information to obtain the vectorized unit information includes:

preprocessing different unit information to obtain preprocessed different unit information;

and vectorizing the different preprocessed unit information to obtain the vectorized unit information.

In one embodiment, the method further comprises:

receiving a full-text search instruction, wherein the full-text search instruction carries key information of full-text search;

responding to the full text search instruction.

In one embodiment, the recalling according to the vectorization result to determine a target full-text search result includes:

similarity processing is carried out on the vectorization unit information and the vectorization key information, and an initial full text search result is obtained;

and sequencing the initial full-text search results to determine target full-text search results.

In one embodiment, the sorting the initial full-text search result and determining the target full-text search result includes:

and sequencing the initial full-text search result according to target search information to obtain the target full-text search result.

In one embodiment, the method further comprises: and storing the vectorization unit information to a full-text search engine.

An unstructured data corpus search system, the system comprising:

the file analysis module is used for analyzing the unstructured file into semi-structured information;

the vectorization module is used for vectorizing the semi-structured information and the key information of full-text search to obtain a vectorization result;

and the recall module is used for recalling according to the vectorization result and determining a target full-text search result.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

analyzing the unstructured file into semi-structured information;

A storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:

analyzing the unstructured file into semi-structured information;

The method analyzes the unstructured document into semi-structured information, carries out vectorization processing on the semi-structured information and key information of full-text search to obtain vectorized results, and carries out recall processing according to the vectorized results to determine target full-text search results; according to the method, the unstructured file does not need to be opened, the unstructured file does not need to be converted into the structured file, the complex process that the index information corresponding to the target text information expected to be searched is obtained firstly, and then the target text information is extracted from the unstructured file according to the index information is executed, the unstructured file can be analyzed into the semi-structured file, then intelligent search direct obtaining is further achieved, the complexity of the unstructured data full-text search method operation is reduced, and the unstructured data search efficiency is improved.

Drawings

FIG. 1 is a flow diagram that illustrates a full text search of unstructured data, under an embodiment;

FIG. 2 is a flow diagram of a vectorization process in another embodiment;

FIG. 3 is a flowchart illustrating recall processing in accordance with an alternative embodiment;

FIG. 4 is a diagram illustrating a response result of a full-text search instruction in accordance with another embodiment;

FIG. 5 is a diagram illustrating text content in html format displayed in a patient management interface file in accordance with another embodiment;

FIG. 6 is a block diagram that illustrates an unstructured data full-text search system, in accordance with an embodiment;

FIG. 7 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The unstructured data full-text searching method provided by the embodiment can be applied to computer equipment. Alternatively, the unstructured data full text search may be understood as a process of searching for a certain content in an unstructured file without opening the unstructured file. The computer device may be an electronic device with an information processing function, such as a smart phone, a tablet computer, a notebook computer, a desktop computer, or a personal digital assistant, and the specific form of the computer device is not limited in this embodiment.

It should be noted that the application environment of the unstructured-data full-text search method provided in the embodiment of the present application may be an unstructured-data full-text search system, and the system may be implemented as part of or all of a computer device by software, hardware, or a combination of software and hardware. The execution subjects of the method embodiments described below are described taking a computer device as an example. In this embodiment, the computer device may install a Document2text plug-in, and implement the above-mentioned unstructured data full-text search method through the Document2text plug-in; the Document2text plug-in may be a custom function plug-in.

Fig. 1 is a flowchart illustrating a full-text search method for unstructured data according to an embodiment. The embodiment relates to a process for implementing full-text search on unstructured text, and is described by taking an example that the method is applied to computer equipment. As shown in fig. 1, the method includes:

and S1000, analyzing the unstructured file into semi-structured information.

Specifically, the computer device may first identify the text file type of each locally stored unstructured file, and then parse the unstructured file into semi-structured information using the Document2html algorithm. Alternatively, the text file type of the unstructured file may be understood as a text format in the unstructured file, i.e., pdf type, doc type, ppt type, and the like. Optionally, the Document2html algorithm may be understood as a text protocol corresponding to the text file type, where the text protocol may be a pdf protocol, a doc protocol, a ppt protocol, and the like. For example, the text protocol corresponding to the pdf-type unstructured file may be pdf protocol, the text protocol corresponding to the doc-type unstructured file may be doc protocol, and the text protocol corresponding to the ppt-type unstructured file may be ppt protocol.

The semi-structured information can be html-format text information, namely webpage-format text information, can be previewed at a webpage end, can also perform page skipping, can quickly locate target text information and can directly acquire the target text information when full-text search operation is realized, meanwhile, the method is convenient for extracting structured data from the semi-structured information, is convenient for material content collection of a full-text search system, and is also convenient for a subsequent recommendation system, manual content recommendation, message pushing and update reminding, and has high operability. The structured information is text information of a table type, cannot be previewed, cannot acquire target text information after a search operation is performed, only can acquire index information corresponding to the target text information, and is low in operability.

In this embodiment, the computer device may parse the unstructured document into semi-structured information contained in chapters and sections; that is, the layout structure of the semi-structured information after the analysis is the same as the layout structure of the text information in the unstructured file before the analysis. Further, the semi-structured information obtained after analysis can be stored in an html server for use in the next full-text search.

In addition, when parsing the doc-type and ppt-type files in the unstructured file into semi-structured information, the doc-type and ppt-type files may be converted into pdf-type files.

S2000, vectorizing the semi-structured information and the key information of the full-text search to obtain a vectorized result.

Specifically, the computer device may perform vectorization processing on the semi-structured information and the key information of the full-text search. Alternatively, the vectorization processing may be understood as a process of converting text information into binarized information. Optionally, the key information of the full-text search may be keywords and/or keywords in the content to be searched.

S3000, recalling according to the vectorization result, and determining a target full-text search result.

Specifically, the computer device may perform recall processing according to the obtained vectorization result to obtain a target full-text search result. Alternatively, the recall process may be understood as finding vectorized results corresponding to content similar to the target full-text search result through similarity calculation.

In the unstructured data full text search method, the unstructured file is analyzed into semi-structured information, vectorization processing is carried out on the semi-structured information and key information of full text search to obtain vectorized results, recall processing is carried out according to the vectorized results, and target full text search results are determined; according to the method, the unstructured file does not need to be opened, the unstructured file does not need to be converted into the structured file, the complex process that the index information corresponding to the target text information expected to be searched is obtained firstly, and then the target text information is extracted from the unstructured file according to the index information is executed, so that the unstructured file can be analyzed into the semi-structured file, the target full-text search result can be directly obtained through intelligent search, the complexity of the operation of the unstructured data full-text search method is reduced, and the search efficiency of unstructured data is improved.

As an embodiment, the vector result includes vectorization unit information and vectorization key information, and as shown in fig. 2, the step of performing vectorization processing on the semi-structured information and the key information of full-text search in S2000 to obtain the vectorization result may be implemented by the following steps:

and S2100, vectorizing the semi-structured information to obtain the vectorized unit information.

Specifically, the computer device may perform vectorization processing on all the converted semi-structured information to obtain vectorized unit information. Optionally, the semi-structured information corresponding to the unstructured document may be multi-page semi-structured information, and each page of semi-structured information corresponds to text information of a corresponding page in the unstructured document before analysis; each page of semi-structured information may include at least one of title content, text content under title, and summary content.

In S2100, the step of performing vectorization processing on the semi-structured information to obtain the vectorized unit information may specifically include: extracting different unit information in the semi-structured information; and vectorizing different unit information to obtain the vectorized unit information.

It should be noted that the computer device may extract different unit information in each page of semi-structured information by using html2text algorithm; the different unit information may be title content, text content under the title, and/or abstract content in the semi-structured information. That is, the computer device may extract all content in each page of semi-structured information, including title content, textual content under title, and/or abstract content.

In addition, the computer device may perform vectorization processing on the extracted different unit information according to the sequence of extracting the different unit information, so as to obtain the vectorized unit information.

The vectorizing processing according to the different unit information to obtain the vectorized unit information may specifically include: preprocessing different unit information to obtain preprocessed different unit information; vectorizing the preprocessed different unit information to obtain the vectorized unit information

In an embodiment, the computer device may perform vectorization processing on different unit information to obtain different preprocessed unit information, and then perform vectorization processing on the different preprocessed unit information by using a vectorization algorithm through a chapter2vec vectorization model. Optionally, the preprocessing may be understood as a process of filtering out redundant punctuation marks and redundant text contents in different unit information; filtering out redundant punctuation marks can be understood as filtering out all punctuation marks in different unit information; filtering out redundant text content may be understood as filtering out prepositions in different units of information. In addition, the vectorization algorithm may be tfidf algorithm, bm25 algorithm, word2vec algorithm, fasttext algorithm, or the like.

Further, after the execution of S2100, the method may further include: and storing the vectorization unit information to a full-text search engine.

In this embodiment, the computer device may store the vectorized unit information to the full-text search engine for use in full-text search of unstructured data by the full-text search engine. Optionally, the full-text search engine may be a distributed multi-user-capability full-text search engine, a high-performance full-text search engine, or the like, such as elastic search, redissearch, Solr, Faiss.

S2200, vectorizing the key information of the full-text search to obtain the vectorized key information.

Specifically, the computer device may perform vectorization processing on the key information of the full-text search by using a vectorization algorithm. Optionally, the key information of the full-text search may be keywords and/or keywords in the content to be searched in the unstructured file.

Before the step of performing vectorization processing on the semi-structured information and the key information of the full-text search in S2000 to obtain a vectorized result, the method may further include the following steps: receiving a full-text search instruction; responding to the full text search instruction; wherein the full-text search instruction comprises key information of full-text search.

It should be noted that, the user searches the key information of the full-text search in the full-text search engine, at this time, the computer device receives the full-text search instruction, and then can respond to the full-text search instruction.

In the unstructured data full text search method, vectorization processing is performed on the analyzed semi-structured information and the key information of full text search to obtain vectorized results, and then recall processing is performed according to the vectorized results to determine target full text search results; the method can carry out vectorization processing on the analyzed semi-structured information and the key information of full-text search, and can more conveniently obtain a target full-text search result, thereby improving the search efficiency of unstructured data.

As an embodiment, as shown in fig. 3, the step of performing recall processing according to the vectorization result and determining a target full-text search result in S3000 may be implemented by the following steps:

and S3100, carrying out similarity processing on the vectorization unit information and the vectorization key information to obtain an initial full text search result.

Specifically, the computer device may calculate a similarity between the vectorization unit information and the vectorization key information to obtain an initial full-text search result. Optionally, the algorithm for calculating the similarity may be a distance algorithm and a coefficient algorithm; the distance algorithm may be euclidean distance, mahalanobis distance, manhattan distance, minkowski distance, and hamming distance; the coefficient algorithm may be cosine similarity, pearson correlation coefficient, Jaccard similarity coefficient, Tanimoto coefficient, etc. Optionally, the obtained initial full-text search result may include the vectorization unit information, the vectorization key information, and the similarity between the vectorization unit information and the vectorization key information.

S3200, sequencing the initial full-text search results, and determining target full-text search results.

Specifically, the computer device may perform sorting processing on the initial full-text search result according to the size of the similarity, and determine text content in the unstructured file corresponding to the vectorization unit information in the initial full-text search result with the large similarity and the corresponding semi-structured information as the target full-text search result.

The step of performing ranking processing on the initial full-text search result and determining the target full-text search result in S3200 may specifically include: and sequencing the initial full-text search result according to target search information to obtain the target full-text search result.

In this embodiment, the computer device may perform ranking processing on the initial full-text search result according to the target search information, so as to obtain a target full-text search result; at this point, the initial full-text search result may be ranked using a ranking model. When training the ranking model, target search information can be introduced for training. Optionally, the ranking model may be a learning2ranking model. Alternatively, the target search information may include common user search information and history search information.

For example, if the key information of the full-text search is patient management, the patient management is input into the full-text search engine (i.e., a full-text search command is input), a computer display interface responding to the result of the full-text search command may be as shown in fig. 4, where fig. 4 shows that the name and the content of the relevant unstructured file of the patient management appear, and then the target full-text search result (i.e., one of the names and the content of the unstructured file shown in fig. 4) searched by the user is determined from the display content; specifically, as shown in fig. 5, fig. 5 is the text content in html format displayed in the patient management interface file.

According to the unstructured data full-text search method, the unstructured files do not need to be opened, the unstructured files do not need to be converted into the structured files, the index information corresponding to the target text information expected to be searched is obtained firstly, then the target text information is extracted from the unstructured files according to the index information, and the unstructured files can be analyzed into the semi-structured files, then intelligent search direct obtaining is further achieved, the complexity of the unstructured data full-text search method operation is reduced, and the unstructured data search efficiency is improved.

It should be understood that although the various steps in the flow charts of fig. 1-3 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 1-3 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least some of the other steps or stages.

Fig. 6 is a schematic structural diagram of an unstructured data full-text search system according to an embodiment. As shown in fig. 6, the system may include: a parsing module 11, a vectorization module 12 and a recall module 13.

Specifically, the parsing module 11 is configured to parse an unstructured file into semi-structured information;

the vectorization module 12 is configured to perform vectorization processing on the semi-structured information and the key information of the full-text search to obtain a vectorization result;

the recall module 13 is configured to perform recall processing according to the vectorization result, and determine a target full-text search result.

The unstructured data full text search system provided by this embodiment may implement the above method embodiments, and the implementation principle and technical effect are similar, which are not described herein again.

In one embodiment, the vector result includes vectorization unit information and vectorization key information; the vectorization module 12 comprises: a first vector quantization unit and a second vector quantization unit.

Specifically, the first vectorization unit is configured to perform vectorization processing on the semi-structured information to obtain vectorization unit information;

the second vectorization unit is configured to perform vectorization processing on the key information of the full-text search to obtain the vectorized key information.

In one embodiment, the first vector quantization unit includes: an extraction subunit and a vectorization subunit.

The extraction subunit is configured to extract different unit information in the semi-structured information;

and the vectorization subunit is used for carrying out vectorization processing according to different unit information to obtain the vectorization unit information.

In one embodiment, the vectorization subunit is specifically configured to perform preprocessing on different unit information to obtain preprocessed different unit information, and perform vectorization on the preprocessed different unit information to obtain the vectorized unit information.

In one embodiment, the unstructured data full text search system further comprises: the device comprises a search instruction receiving module and a search instruction vector module.

Specifically, the search instruction receiving module is configured to receive a full-text search instruction, where the full-text search instruction carries key information of the full-text search;

and the searching instruction vector module is used for responding to the full text searching instruction.

In one embodiment, the recall module 13 includes: a similarity processing unit and a sorting unit.

Specifically, the similarity processing unit is configured to perform similarity processing on the vectorization unit information and the vectorization key information to obtain an initial full-text search result;

and the sequencing unit is used for sequencing the initial full-text search result and determining a target full-text search result.

In one embodiment, the sorting unit is specifically configured to perform sorting processing on the initial full-text search result according to target search information, and obtain the target full-text search result.

In one embodiment, the unstructured data full text search system further comprises: and a storage module.

The storage module is used for storing the vectorization unit information to a full-text search engine.

For specific limitations of the unstructured data full-text search system, reference may be made to the above limitations on the time code time service calibration method, which are not described herein again. The various modules in the above-described unstructured data full text search system may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the terminal, and can also be stored in a memory in the computer equipment in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 7. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a time code timing calibration method.

Those skilled in the art will appreciate that the configuration shown in fig. 7 is a block diagram of only a portion of the configuration associated with the present application and does not constitute a limitation on the terminal to which the present application is applied, and that a particular terminal may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:

analyzing the unstructured file into semi-structured information;

In one embodiment, a storage medium is provided having a computer program stored thereon, the computer program when executed by a processor implementing the steps of:

analyzing the unstructured file into semi-structured information;

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for full-text searching of unstructured data, the method comprising:

analyzing the unstructured file into semi-structured information;

2. The method of claim 1, wherein the vector result comprises vectorization unit information and vectorization key information;

3. The method according to claim 2, wherein the vectorizing the semi-structured information to obtain vectorized unit information comprises:

extracting different unit information in the semi-structured information;

and vectorizing different unit information to obtain the vectorized unit information.

4. The method according to claim 3, wherein the performing vectorization processing according to different unit information to obtain the vectorized unit information comprises:

5. The method according to any one of claims 2-4, further comprising:

receiving a full-text search instruction, wherein the full-text search instruction comprises key information of full-text search;

responding to the full text search instruction.

6. The method of claim 1, wherein the recalling from the vectorized result to determine a target full-text search result comprises:

7. The method of claim 6, wherein said ranking said initial full-text search results and determining target full-text search results comprises:

8. The method of claim 5, further comprising: and storing the vectorization unit information to a full-text search engine.

9. An unstructured data full text search system, the system comprising:

10. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 8.