CN110674087A

CN110674087A - File query method and device and computer readable storage medium

Info

Publication number: CN110674087A
Application number: CN201910829794.5A
Authority: CN
Inventors: 钱克功; 沈网中
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-09-03
Filing date: 2019-09-03
Publication date: 2020-01-10
Also published as: WO2021043088A1

Abstract

The invention relates to an artificial intelligence technology, and discloses a file query method, which comprises the following steps: acquiring a collection file set of a client, creating service description of the collection file set in a file system, and storing the collection file set after the service description is created into a cloud storage; extracting keywords from the service description through a keyword extraction algorithm to obtain keywords of the service description, converting the keywords into word vectors, and storing the word vectors; receiving query content input by a user, and calculating the similarity between the query content and the word vector; and selecting corresponding service description according to the similarity, inquiring the collected files from the cloud storage in a multi-strategy retrieval mode, and returning an inquiry result to the user. The invention also provides a file inquiry device and a computer readable storage medium. The invention realizes the accurate query of the file.

Description

File query method and device and computer readable storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a file query method and device and a computer readable storage medium.

Background

As the amount of information has increased explosively with the development of technology, more and more files need to be stored in the user's computer. The file system of the computer is responsible for establishing files for users, and controlling the access of the files through storing, reading, modifying and dumping the files. When the user does not use the file any more, the file can be cancelled, deleted and the like, so that the file system of the computer can support the storage of massive files. However, for users, in the face of massive files, a certain amount of time and energy are needed to retrieve target files, and no related technology or product capable of quickly querying files exists in the industry at present.

Disclosure of Invention

The invention provides a file query method, a file query device and a computer readable storage medium, and mainly aims to present an accurate file query result to a user when the user queries a file in a text.

In order to achieve the above object, the present invention provides a file query method, including:

acquiring a collection file set of a client, creating service description of the collection file set in a file system, and storing the collection file set after the service description is created into a cloud storage;

extracting keywords from the service description through a keyword extraction algorithm to obtain keywords of the service description, converting the keywords into word vectors, and storing the word vectors;

receiving query content input by a user, and calculating the similarity between the query content and the word vector;

and selecting corresponding service description according to the similarity, inquiring the collected files from the cloud storage in a multi-strategy retrieval mode, and returning an inquiry result to the user.

Optionally, the obtaining the collection file set of the client includes:

traversing and retrieving from a local disk of the client to obtain the collection file set; or

And searching the collected file set from a search engine by using the keywords according to the requirements of the user.

Optionally, the extracting keywords from the service description by using a keyword extraction algorithm includes:

performing word segmentation operation on the service description;

calculating any two words W in the service description_iAnd W_jDependence relevance of (2):

wherein, Dep (W)_i，W_j) Represents the word W_iAnd W_jDependence degree of (2), len (W)_i，W_j) Represents the word W_iAnd W_jB is a hyper-parameter;

calculating the word W_iAnd W_jThe gravity of (2):

wherein f is_grav(W_i，W_j) Represents the word W_iAnd W_jGravitation of, tfidf (W)_i) The expression W_iTF-IDF value of (1), tfidf (W)_j) The expression W_jTF-IDF value of (1), TF represents word frequency, IDF represents inverse document frequency index, d is word W_iAnd W_jThe euclidean distance between the word vectors of (a);

obtaining the word W according to the calculated dependency relevance and the gravity_iAnd W_jThe strength of the association between:

weight(W_i，W_j)＝Dep(W_i，W_j)*fgrav(W_i，W_j)

calculating the word W in combination with the correlation strength_jThe importance score of (a):

wherein the content of the first and second substances,

is at the vertex W_iA related set, η is a damping coefficient;

according to the word W_iThe importance degree score selects t words with the highest score as the keywords of the service description.

Optionally, the calculation formula of the similarity between the query content and the word vector is as follows:

wherein X represents the word vector and Y represents the query content.

Optionally, the querying of the favorite file to the cloud storage through a multi-policy retrieval manner includes:

presetting an original character string m in query content input by the user and a service description target character string n of the collection file;

recording the editing times L of deletion, insertion and replacement operations required by the conversion of the original character string m into the target character string n;

and selecting the corresponding collection file with the minimum L value as a query result, and returning the query result to the user.

In addition, in order to achieve the above object, the present invention further provides a file query apparatus, which includes a memory and a processor, wherein the memory stores a file query program operable on the processor, and the file query program, when executed by the processor, implements the following steps:

Optionally, the obtaining the collection file set of the client includes:

performing word segmentation operation on the service description;

calculating the word W_iAnd W_jThe gravity of (2):

wherein f is_grav(W_i，W_j) Represents the word W_iAnd W_jGravitation of, tfidf (W)_i) The expression W_iTF-IDF value of (1), tfidf (W)_j) The expression W_jTF-IDF value of (TF represents word frequency, IDF represents inverse document frequency)Index, d is the word W_iAnd W_iThe euclidean distance between the word vectors of (a);

weight(W_i，W_j)＝Dep(W_i，W_j)*f_grav(W_i，W_j)

wherein the content of the first and second substances,

is at the vertex W_iA related set, η is a damping coefficient;

wherein X represents the word vector and Y represents the query content.

In addition, to achieve the above object, the present invention also provides a computer readable storage medium having a file query program stored thereon, where the file query program is executable by one or more processors to implement the steps of the file query method as described above.

According to the file query method, the file query device and the computer readable storage medium, when a user queries files, the collected files are analyzed for service description based on the collected files of the client, the similarity between the query content of the files required by the user input and the analyzed service description is calculated, the files are queried in a multi-strategy retrieval mode according to the similarity, the query result is returned to the user, and the accurate file query result can be presented to the user.

Drawings

Fig. 1 is a schematic flowchart of a file query method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of an internal structure of a file query apparatus according to an embodiment of the present invention;

fig. 3 is a block diagram illustrating a file query program in a file query device according to an embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention provides a file query method. Fig. 1 is a schematic flow chart of a file query method according to an embodiment of the present invention. The method may be performed by an apparatus, which may be implemented by software and/or hardware.

In this embodiment, the file query method includes:

s1, acquiring a collection file set of the client, creating service description of the collection file set in the file system, and storing the collection file set after creating the service description into cloud storage.

In the preferred embodiment of the present invention, the client is also called a client, which refers to a program corresponding to the server for providing local services to the client. The collection file set of the client is obtained by the following two ways: in the first mode, traversing and retrieving are carried out from a local disk of the client to obtain the collection file set; and secondly, searching the collected file set from a search engine by using the keywords according to the requirements of the user.

The Cloud storage refers to a mode of online storage (Cloud storage), i.e., data is stored on a plurality of virtual servers, which are usually hosted by third parties, rather than on dedicated servers.

Preferably, the file system in the present invention is a Hadoop Distributed File System (HDFS). The HDFS has high fault tolerance and can be deployed on low-cost hardware, and meanwhile, the HDFS relaxes the requirement on a portable operating system interface, so that the HDFS can access file data in a streaming mode, thereby providing high-throughput access to data of an application program and being suitable for the application program with a large data set.

In detail, the HDFS is composed of a NameNode (master node) and n datanodes (slave nodes), where the NameNode is mainly responsible for managing a file namespace and a master server accessed by a client, and the DataNode is responsible for managing file storage. In the preferred embodiment of the present invention, a service description of the collection file set is created in the master node of the HDFS file system.

Further, the service description refers to a brief summary of the content of the collection file set, and may also be represented as a name of the collection file set.

S2, extracting keywords from the service description through a keyword extraction algorithm to obtain the keywords of the service description, converting the keywords into word vectors, and storing the word vectors.

In a preferred embodiment of the present invention, the extracting keywords from the service description by using a keyword extraction algorithm includes:

performing word segmentation operation on the service description;

calculating the word W_iAnd W_jThe gravity of (2):

weight(W_i，W_j)＝Dep(W_i，W_j)*f_grav(W_i，W_j)

wherein the content of the first and second substances,is at the vertex W_iA related set, η is a damping coefficient;

preferably, the invention selects t words with the highest scores as the keywords of the service description according to the importance scores of the words.

Further, the invention utilizes a one-hot representation (one hot) algorithm to convert the keywords into word vectors for representation. The one-hot representation algorithm is a basic method for representing vectors of words, is similar to the concept of a word bag model, a dictionary is constructed by extracting all words in a corpus, each word in the dictionary is represented by a word vector, the dimension of the word vector is equal to the dictionary scale, the value of the dimension corresponding to the current word in the vector is 1, and the values of the other dimensions are all 0, so that the dimension of the keyword described by the business is converted into 1, and the dimensions of the other words are 0, so that the keyword can be converted into the word vector representation.

S3, receiving query contents input by a user, and calculating the similarity between the query contents and the word vectors.

In the preferred embodiment of the present invention, the similarity between the query content and the word vector is calculated by using a cosin method (cosine similarity). The cosine similarity is a measure for measuring the difference between two individuals by using the cosine value of the included angle between two vectors in a vector space, wherein the closer the cosine value of the cosine similarity is to 1, the closer the included angle between the two vectors is to 0 degree, namely the more similar the two vectors are. The calculation formula of the cosine similarity is as follows:

wherein X represents the word vector, Y represents the query content, and the similarity range of cosine values of the cosine similarity is-1 to 1: when the cosine value is-1, it indicates that the directions pointed by the query content and the word vector are exactly opposite, which indicates that the similarity between the query content and the word vector is 0, and when the cosine value is 1, it indicates that the directions pointed by the query content and the word vector are completely the same, which indicates that the similarity between the query content and the word vector is 100%, and when the cosine value is 0, it indicates that the query content and the word vector are independent, which indicates that the similarity or the difference between the query content and the word vector is moderate. According to the cosine value, the similarity between the query content and the word vector is obtained.

And S4, selecting corresponding service description according to the similarity, inquiring the collected files from the cloud storage in a multi-strategy retrieval mode, and returning the inquiry result to the user.

In the preferred embodiment of the present invention, the multi-strategy search method includes Levenshtein Distance (LD). And when the query content input by the user is matched, comparing the query content with the service description of the collected file in the cloud storage through the similarity calculation method. If the collection files are matched with the user files, the collection files are directly returned to the user; if not, similarity calculation is carried out on the query content input by the user and the keywords of the service description in the collected files, a preset threshold value is 0.8, and the collected files corresponding to the service description with the similarity result larger than the preset threshold value are used as query results and returned to the user.

Further, when none of the similarity results is greater than a preset threshold, the similarity between the query content input by the user and the character strings in the service description of the favorite file is calculated through the LD. In detail, the invention presets the original character string in the query content input by the user as m, the service description target character string of the collection file as n, records the editing times L of the deletion, insertion and replacement operation required by the conversion of the original character string m into the target character string n, and records the L of the 2 character strings m and n as lev_m，n(| m |, | n |), wherein | m |, | n | is the length of the character string m, n respectively. And when the L is larger, the similarity of the character strings is lower, and then the corresponding collection file with the minimum L value is selected as a query result and returned to the user.

The invention also provides a file inquiry device. Fig. 2 is a schematic diagram illustrating an internal structure of a file query apparatus according to an embodiment of the present invention.

In this embodiment, the file inquiry apparatus 1 may be a PC (Personal Computer), a terminal device such as a smart phone, a tablet Computer, or a mobile Computer, or may be a server. The file querying device 1 comprises at least a memory 11, a processor 12, a communication bus 13, and a network interface 14.

The memory 11 includes at least one type of readable storage medium, which includes a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, and the like. The memory 11 may in some embodiments be an internal storage unit of the file querying device 1, such as a hard disk of the file querying device 1. The memory 11 may also be an external storage device of the file query apparatus 1 in other embodiments, such as a plug-in hard disk provided on the file query apparatus 1, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. Further, the memory 11 may also include both an internal storage unit and an external storage device of the file inquiry apparatus 1. The memory 11 may be used not only to store application software installed in the file search apparatus 1 and various types of data, such as a code of the file search program 01, but also to temporarily store data that has been output or is to be output.

The processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor or other data Processing chip in some embodiments, and is used for executing program codes stored in the memory 11 or Processing data, such as executing the file query program 01.

The communication bus 13 is used to realize connection communication between these components.

The network interface 14 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), typically used to establish a communication link between the apparatus 1 and other electronic devices.

Optionally, the apparatus 1 may further comprise a user interface, which may comprise a Display (Display), an input unit such as a Keyboard (Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the document querying device 1 and for displaying a visualized user interface.

Fig. 2 shows only the document querying device 1 with the components 11 to 14 and the document querying program 01, and it will be understood by those skilled in the art that the structure shown in fig. 1 does not constitute a limitation of the document querying device 1, and may comprise fewer or more components than those shown, or may combine some components, or may be arranged differently.

In the embodiment of the apparatus 1 shown in fig. 2, a file query program 01 is stored in the memory 11; the following steps are implemented when the processor 12 executes the file query program 01 stored in the memory 11:

the method comprises the steps of firstly, acquiring a collection file set of a client, creating service description of the collection file set in a file system, and storing the collection file set after the service description is created into cloud storage.

And step two, extracting keywords from the service description through a keyword extraction algorithm to obtain the keywords of the service description, converting the keywords into word vectors, and storing the word vectors.

performing word segmentation operation on the service description; calculating any two words W in the service description_iAnd W_jDependence relevance of (2):

calculating the word W_iAnd W_jThe gravity of (2):

weight(W_i，W_j)＝Dep(W_i，W_j)*f_grav(W_i，W_j)

And step three, receiving query contents input by a user, and calculating the similarity between the query contents and the word vectors.

And step four, selecting corresponding service description according to the similarity, inquiring the collected files from the cloud storage in a multi-strategy retrieval mode, and returning an inquiry result to the user.

Alternatively, in other embodiments, the file query program may be divided into one or more modules, and the one or more modules are stored in the memory 11 and executed by one or more processors (in this embodiment, the processor 12) to implement the present invention.

For example, referring to fig. 3, a schematic diagram of program modules of a document query program in an embodiment of the document query apparatus of the present invention is shown, in this embodiment, the document query program may be divided into a service description creation module 10, a keyword extraction module 20, a similarity calculation module 30, and a query module 40, which exemplarily:

the service description creation module 10 is configured to: the method comprises the steps of obtaining a collection file set of a client, creating service description of the collection file set in a file system, and storing the collection file set after the service description is created into cloud storage.

The keyword extraction module 20 is configured to: and extracting keywords from the service description through a keyword extraction algorithm to obtain the keywords of the service description, converting the keywords into word vectors, and storing the word vectors.

The similarity calculation module 30 is configured to: and receiving query content input by a user, and calculating the similarity between the query content and the word vector.

The query module 40 is configured to: and selecting corresponding service description according to the similarity, inquiring the collected files from the cloud storage in a multi-strategy retrieval mode, and returning an inquiry result to the user.

The functions or operation steps implemented when the program modules such as the text service description creation module 10, the keyword extraction module 20, the similarity calculation module 30, and the query module 40 are executed are substantially the same as those in the above embodiments, and are not described herein again.

Furthermore, an embodiment of the present invention further provides a computer-readable storage medium, where a file query program is stored on the computer-readable storage medium, where the file query program is executable by one or more processors to implement the following operations:

The embodiment of the computer-readable storage medium of the present invention is substantially the same as the embodiments of the file querying device and method, and will not be described herein again.

It should be noted that the above-mentioned numbers of the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments. And the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A file query method, the method comprising:

2. The file query method of claim 1, wherein the obtaining of the favorite file set of the client comprises:

3. The document searching method of claim 1, wherein the extracting the keywords from the service description by the keyword extraction algorithm comprises:

performing word segmentation operation on the service description;

wherein, Dep (W)_i,W_j) Represents the word W_iAnd W_jDependence degree of (2), len (W)_i,W_j) Represents the word W_iAnd W_jB is a hyper-parameter;

calculating the word W_iAnd W_jThe gravity of (2):

wherein f is_grav(W_i,W_j) Represents the word W_iAnd W_jGravitation of, tfidf (W)_i) The expression W_iTF-IDF value of (1), tfidf (W)_j) The expression W_jTF-IDF value of (1), TF represents word frequency, IDF represents inverse document frequency index, d is word W_iAnd W_jThe euclidean distance between the word vectors of (a);

weight(W_i,W_j)＝Dep(W_i,W_j)*f_grav(W_i,W_j)

calculating the word W in combination with the correlation strength_iThe importance score of (a):

wherein the content of the first and second substances,

is at the vertex W_iA related set, η is a damping coefficient;

4. The document query method of claim 1, wherein the similarity between the query content and the word vector is calculated by the formula:

wherein X represents the word vector and Y represents the query content.

5. The file query method according to any one of claims 1 to 4, wherein the querying of the favorite files to the cloud storage through a multi-policy retrieval manner includes:

6. A document inquiry apparatus, comprising a memory and a processor, wherein the memory stores a document inquiry program operable on the processor, and the document inquiry program when executed by the processor implements the steps of:

7. The file query apparatus of claim 6, wherein the obtaining of the favorite file set of the client comprises:

8. The apparatus for querying documents according to claim 6, wherein said extracting keywords from said service description by keyword extraction algorithm comprises:

performing word segmentation operation on the service description;

calculating the word W_iAnd W_jThe gravity of (2):

weight(W_i,W_j)＝Dep(W_i,W_j)*f_grav(W_i,W_j)

wherein the content of the first and second substances,

is at the vertex W_iA related set, η is a damping coefficient;

9. The document querying device according to claim 6, wherein the similarity between the query content and the word vector is calculated by the formula:

wherein X represents the word vector and Y represents the query content.

10. A computer-readable storage medium having stored thereon a file query program executable by one or more processors to perform the steps of the file query method as claimed in any one of claims 1 to 5.