CN115905297B

CN115905297B - Method, apparatus and medium for retrieving data

Info

Publication number: CN115905297B
Application number: CN202310006697.2A
Authority: CN
Inventors: 梁展钊
Original assignee: Shanghai Maice Data Technology Co ltd; Maice Shanghai Intelligent Technology Co ltd
Current assignee: Shanghai Maice Data Technology Co ltd; Maice Shanghai Intelligent Technology Co ltd
Priority date: 2023-01-04
Filing date: 2023-01-04
Publication date: 2023-12-15
Anticipated expiration: 2043-01-04
Also published as: CN115905297A

Abstract

Embodiments of the present disclosure relate to a method, apparatus, and medium for retrieving data, including: establishing a first database of characters mapped to syllables and a second database of the syllables mapped to syllables; performing preprocessing on the acquired first data, thereby acquiring an array including one or more standardized elements; traversing each element in the array based on the first database, thereby obtaining second data and saving the second data to a third database for full text retrieval; performing word segmentation on the data to be retrieved, thereby obtaining a word-segmented object and expressing the obtained object as a syllable expression based on a regular expression; retrieving the syllable expression in the second database, thereby obtaining the index of the syllable mapped to the pinyin of the data to be retrieved; and retrieving the third database based on the obtained index, thereby obtaining a retrieval result regarding the data to be retrieved.

Description

Method, apparatus and medium for retrieving data

Technical Field

Embodiments of the present disclosure relate generally to the field of data processing and, more particularly, relate to a method, computing device, and computer-readable storage medium for retrieving data.

Background

The chinese data collected in the data system is typically stored in the database in direct chinese, e.g., by utf8 encoding. However, when such data is used for a query, the user is typically not queried in exact chinese.

For example, the database stores the user's name, zhang san, zhang s, zs may all represent Zhang san, wchy, wzhy, wangchongyang may all represent Wang Chongyang, wang Chongyang, which may be knocked out by the user at the time of the query. Meanwhile, due to the possibility of polyphones, a user does not necessarily determine the correct pronunciation of Chinese data.

Some existing database retrieval schemes are based on word segmentation implementations. The corresponding pinyin of the data will support a custom thesaurus. However, this requires intrusive modification of the database. Even though plug-in modules of different databases are required to be installed through plug-in modes, potential compatibility problems exist, the quantity of word libraries is infinite, the larger and the more accurate, but the resource occupation quantity (memory and the like) and the query speed are also affected to a certain extent.

In summary, the conventional method for retrieving data has the following disadvantages: the prior pinyin searching technical proposal has the problems of incomplete searching results or redundant searching results.

Disclosure of Invention

In view of the foregoing, the present disclosure provides a method, computing device, and computer-readable storage medium for retrieving data that accurately enables a user's pinyin input (full-pinyin, simple-pinyin, mixed-input) to quickly query a database for corresponding chinese.

According to a first aspect of the present disclosure, there is provided a method for retrieving data, comprising: establishing a first database of characters mapped to syllables and a second database of the syllables mapped to syllables; performing preprocessing on the acquired first data, thereby acquiring an array including one or more standardized elements; traversing each element in the array based on the first database, thereby obtaining second data and saving the second data to a third database for full text retrieval; performing word segmentation on the data to be retrieved, thereby obtaining a word-segmented object and expressing the obtained object as a syllable expression based on a regular expression; retrieving the syllable expression in the second database, thereby obtaining the index of the syllable mapped to the pinyin of the data to be retrieved; and retrieving the third database based on the obtained index, thereby obtaining a retrieval result regarding the data to be retrieved.

According to a second aspect of the present disclosure, there is provided a computing device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect of the present disclosure.

In a third aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method of the first aspect of the present disclosure.

In some embodiments, further comprising: calculating the number of pinyin characters mapped to pinyin by characters in the first database and the character distance; marking similar pinyin of each pinyin according to the calculated number of pinyin characters and the calculated character distance; and mapping each pinyin to its corresponding similar pinyin, thereby obtaining a mapped similar pinyin index, wherein the similar pinyin index is saved to the third database.

In some embodiments, mapping each pinyin to its corresponding similar pinyin further includes: in response to the pinyin being mapped to a plurality of similar pinyins corresponding thereto, setting a first probability for each of the plurality of similar pinyins; obtaining a second probability of each pinyin in the similar pinyin according to the obtained confirmation rate of the retrieval result; calculating a third probability of each of the similar pinyin based on the first probability and the second probability; and retrieving the third database based on the calculated third probability, thereby obtaining a retrieval result.

In some embodiments, calculating the third probability of each of the similar pinyin comprises: acquiring an update time period of the second probability; determining a weight value of the second probability in a different update time period based on the acquired update time period, wherein the weight value varies based on a functional relation with the update time period; and calculating a third probability for each of the similar pinyin based on the updated time period, the first and second probabilities, and the changed weight value.

In some embodiments, further comprising: and deleting each of the similar pinyin in the third database in response to the third probability of the similar pinyin being below a predetermined threshold.

In some embodiments, establishing the first database of text mappings to pinyin comprises: defining pinyin separators; separating a plurality of similar pinyins corresponding to the characters based on the defined pinyin separators; mapping the characters to a plurality of similar pinyins, thereby establishing a hash index of the character mapping pinyins; and establishing a regular expression of the text based on the expression form of the similar pinyin.

In some embodiments, performing preprocessing on the acquired first data includes: executing word segmentation on the first data to obtain a word segmentation result of the first data; based on the established regular expression, confirming whether the word segmentation result of the first data accords with the regular expression; in response to the word segmentation result of the first data not conforming to the regular expression, replacing the word segmentation result with blank characters; and in response to the word segmentation result of the first data conforming to the regular expression, replacing the word segmentation result with a plurality of similar pinyin or a plurality of similar pinyin indexes according to the regular expression.

In some embodiments, retrieving the third database comprises: a third database supporting the nmgram index is retrieved based on the nmgram full text parser in MySQL.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The above and other features, advantages and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, the same or similar reference numerals denote the same or similar elements.

Fig. 1 shows a schematic diagram of a system 100 for implementing a method for retrieving data according to an embodiment of the present disclosure.

Fig. 2 illustrates a flow chart of a method 200 for retrieving data according to an embodiment of the present disclosure.

Fig. 3 shows a block diagram of an electronic device 300 according to an embodiment of the disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The term "comprising" and variations thereof as used herein means open ended, i.e., "including but not limited to. The term "or" means "and/or" unless specifically stated otherwise. The term "based on" means "based at least in part on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment. The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like, may refer to different or the same object. Other explicit and implicit definitions are also possible below.

Fig. 1 shows a schematic diagram of a system 100 for implementing a method for retrieving data according to an embodiment of the present disclosure. As shown in fig. 1, the system 100 includes a computing device 110 and a chinese data management device 130 and a network 130. The computing device 110, the chinese data management device 130 may interact with data over a network 130 (e.g., the internet).

The chinese data management device 130 may, for example, perform conventional management of chinese data, such as collecting, storing chinese data. The chinese data management device 130 may also send the managed chinese data to the computing device 110. The chinese data management device 130 is for example and not limited to: desktop computers, laptop computers, netbook computers, tablet computers, web browsers, e-book readers, personal Digital Assistants (PDAs), wearable computers (such as smartwatches and activity tracker devices), and the like, which may perform chinese data reading and modification. The chinese data management device 130 may be configured to store chinese data, send the chinese data to the computing device 210 via the network 130, and receive the chinese data from the processing of the computing device 210.

With respect to the computing device 110, it is for example for receiving chinese data from the chinese data management device 130 via the network 130; data meanings are mined for received chinese data. Computing device 110 may also determine an associated object, company, etc., of the chinese data based on the mined chinese data. Computing device 110 may have one or more processing units, including special purpose processing units such as GPUs, FPGAs, ASICs, and the like, as well as general purpose processing units such as CPUs. In addition, one or more virtual machines may also be running on each computing device 110. In some embodiments, the computing device 110 and the chinese data management device 130 may be integrated together or may be separate from each other. In some embodiments, computing device 110 includes, for example, database module 112, preprocessing module 114, traversal module 116, word splitting module 118, and retrieval module 120.

A database module 112, the database module 112 being configured to build a first database of text to pinyin and a second database of pinyin to syllables.

A preprocessing module 114, the preprocessing module 114 being configured to perform preprocessing on the acquired first data, thereby acquiring an array comprising one or more standardized elements.

A traversing module 116, the traversing module 116 configured to traverse each element in the array based on the first database, thereby obtaining second data and saving the second data to a third database for full text retrieval use.

A word segmentation module 118, the word segmentation module 118 configured to perform word segmentation on data to be retrieved, thereby obtaining a segmented object and expressing the obtained object as a syllable expression based on a regular expression.

A retrieval module 120, the retrieval module 120 being configured to retrieve the syllable expression in the second database, thereby obtaining an index of pinyin mapping to syllables of the data to be retrieved. The retrieval module 120 is further configured to retrieve the third database based on the retrieved index, thereby obtaining a retrieval result.

Fig. 2 illustrates a flow chart of a method 200 for retrieving data according to an embodiment of the present disclosure. The method 200 may be performed by the computing device 110 shown in fig. 1, or at the electronic device 300 shown in fig. 3. It should be understood that method 200 may also include additional blocks not shown and/or that the blocks shown may be omitted, the scope of the disclosure being not limited in this respect.

At step 202, the computing device 110 creates a first database of text to pinyin and a second database of pinyin to syllables.

In one embodiment, for the first database, the computing device 110 defines pinyin separators. A separator may be defined and computing device 110 may choose to use "|" because the "|" symbol does not appear in pinyin. While other symbols may be selected. Based on the defined pinyin separator, a plurality of similar pinyin corresponding to the text is separated. Similar pinyin refers to a plurality of pinyin representations corresponding to a single chinese character or character, including but not limited to polyphones. Common popular pronunciations for Chinese characters or characters may also be included in the technical solutions of the present disclosure. For example, a single Chinese character "good" may be labeled and indexed to three pinyins of |ling|xing|yuan|.

All pinyin can wrap the beginning and end of pinyin with the above symbols. All pronunciations of the polyphones are mapped to a plurality of similar pinyins using separator separation, thereby creating a hash index of the text mapped pinyins. The computing device 110 may annotate the above pinyin and separator for the near 4.2 ten thousand words actually used. The labels are stored in a database table and are hashed and indexed. "㐀" can be labeled and indexed to | qiu |, the word "㐁" can be labeled and indexed to |tian|, and the word "㐄" can be labeled and indexed to | kua |. Based on the expression form of the similar pinyin, a regular expression of the characters is established. The pinyin of all the words is combined with the alphabets to manufacture a regular expression, and the sample is [0-9a-zA-Z〇㐀㐁㐄] ".

For the second database, the computing device 110 may build a second database of pinyin mappings to syllables. Taking full spelling/simple spelling as an example, all full spelling, initials, finals, and finals can be listed and stored in a database table, and hash indexes are made. For example shang indexes to syllable syllabic, ang indexes to tail final, b indexes to initial, a indexes to vowels.

At step 204, computing device 110 performs preprocessing on the acquired first data to acquire an array comprising one or more standardized elements.

In one embodiment, the computing device 110 may obtain first data, such as an input of "our tomorrow go shoPPing-. Computing device 110 may perform regularization on the text, replacing all of the characters that do not conform to the regular expression in the first database with spaces: "our tomorrow shoPPing". The computing device 110 may then tokenize the text. That is, a sentence or word component is divided into individual Chinese characters as a token (word division result), and successive numbers and English words can be grouped into a token to obtain an array including one or more standardized elements, such as [ "I", "S", "G", "shoPPing" ].

At step 206, computing device 110 traverses each element in the array based on the first database, thereby obtaining second data and saving the second data to a third database for full text retrieval use.

In one embodiment, computing device 110 may convert the split word into pinyin through the first database, e.g., by querying the first database, to convert [ "i", "people", "bright", "day", "go", "shoPPing" ] to [ "|wo|", "men|", "mini|men|", "go", "shoPPing" ]. The computing device 110 may then split the text, e.g., with spaces, into text that is available for the search plug-in ngram index of MySQL: "|wo|men|ing|ing|tie|go shoPPing", as second data. And finally, storing the second data ("women) and name tie go shoPPing") into a third database for full-text retrieval.

At step 208, the computing device 110 performs a word segmentation on the data to be retrieved, thereby obtaining a segmented object and expressing the obtained object as a syllable expression based on a regular expression.

In one embodiment, the computing device 110 may perform word segmentation on the data to be retrieved, query input. It may first be determined whether the pinyin search is performed and if so, the search continues. The computing device 110 may search for pinyin token word segmentation results based on the second database. For example, the data to be searched is "wchy", which may be classified into "w ch y", the data to be searched is "wang chong yang", which may be classified into "wang chong yang", and the data to be searched is "wchongy", which may be classified into "w.

At step 210, computing device 110 retrieves the syllable expression in the second database, thereby obtaining an index of pinyin mappings to syllables for the data to be retrieved.

In one embodiment, the computing device 110 may perform a polyphone query based on a regular expression or syllable expression as described above. The data to be retrieved is "wchy", can be divided into "\S \ ] w\\\s\s \ w/v S/S \. The data to be retrieved is "wangchongyang", can be divided into "\S \ ] i wang\\s \s \ i wang \i\ S/S. The data to be retrieved is "wchongy", can be divided into "\s\w\s\s \" |chong\\s\\y \\s ", thereby obtaining the index of the syllable mapped by the pinyin of the data to be retrieved.

In step 212, the computing device 110 retrieves the third database based on the obtained index, thereby obtaining a retrieval result regarding the data to be retrieved.

In one embodiment, the computing device 110 may execute an ngram index-based query in a third database with the retrieved indexes and expressions, thereby obtaining retrieval results for the data to be retrieved and returned to the user. Databases used in the present disclosure include various types of databases supporting the nmram index, including but not limited to postgresql, mysql, sqlserver, and the like, databases commonly used in the art.

Fig. 3 illustrates a schematic block diagram of an example electronic device 300 that may be used to implement embodiments of the present disclosure. For example, computing device 110 as shown in FIG. 1 may be implemented by electronic device 300. As shown, the electronic device 300 includes a Central Processing Unit (CPU) 301 that can perform various suitable actions and processes in accordance with computer program instructions stored in a Read Only Memory (ROM) 302 or loaded from a storage unit 308 into a Random Access Memory (RAM) 303. In the random access memory 303, various programs and data required for the operation of the electronic device 300 may also be stored. The central processing unit 301, the read only memory 302 and the random access memory 303 are connected to each other via a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.

A number of components in the electronic device 300 are connected to the input/output interface 305, including: an input unit 306 such as a keyboard, mouse, microphone, etc.; an output unit 307 such as various types of displays, speakers, and the like; a storage unit 308 such as a magnetic disk, an optical disk, or the like; and a communication unit 309 such as a network card, modem, wireless communication transceiver, etc. The communication unit 309 allows the device 300 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The various processes and treatments described above, such as method 200, may be performed by central processing unit 301. For example, in some embodiments, the method 200 may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 308. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 300 via the read only memory 302 and/or the communication unit 309. One or more of the acts of the method 200 described above may be performed when a computer program is loaded into the random access memory 1303 and executed by the central processing unit 301.

The present disclosure relates to methods, apparatus, systems, electronic devices, computer readable storage media, and/or computer program products. The computer program product may include computer readable program instructions for performing various aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge computing devices. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

Computer program instructions for performing the operations of the present disclosure can be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, c++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

It will be appreciated by persons skilled in the art that the present invention is not limited to the embodiments described above, but may be embodied in many other forms without departing from the spirit or scope thereof. Accordingly, the illustrated examples and embodiments are to be considered as illustrative and not restrictive, and the invention is intended to cover various modifications and substitutions without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for retrieving data, comprising:

establishing a first database of characters mapped to pinyin and a second database of the characters mapped to syllables, wherein the first database comprises a plurality of pinyin expressions corresponding to the characters and commonly used colloquial pronunciations, and defines pinyin separators, the characters are mapped to the similar pinyins based on the defined pinyin separators, so that a hash index of the characters mapped to the pinyin is established, a regular expression of the characters is established based on the expression form of the similar pinyin, and the second database comprises all spellings, initials, finals and tail sounds of the pinyin;

performing preprocessing on the acquired first data, thereby acquiring an array comprising one or more standardized elements, wherein the preprocessing comprises performing word segmentation on the first data, and acquiring a word segmentation result of the first data; based on the established regular expression, confirming whether the word segmentation result of the first data accords with the regular expression; in response to the word segmentation result of the first data not conforming to the regular expression, replacing the word segmentation result with blank characters; and in response to the word segmentation result of the first data conforming to the regular expression, replacing the word segmentation result with a plurality of similar pinyin or a plurality of similar pinyin indexes according to the regular expression;

traversing each element in the array based on the first database, thereby obtaining second data and saving the second data to a third database for full text retrieval;

calculating the number of pinyin characters mapped to pinyin by characters in the first database and the character distance;

marking similar pinyin of each pinyin according to the calculated number of pinyin characters and the calculated character distance;

mapping each pinyin to its corresponding similar pinyin, thereby obtaining mapped similar pinyin indexes, wherein the similar pinyin indexes are saved to the third database;

performing word segmentation on the data to be retrieved, thereby obtaining a word-segmented object and expressing the obtained object as a syllable expression based on a regular expression;

retrieving the syllable expression in the second database, thereby obtaining the index of the syllable mapped to the pinyin of the data to be retrieved; and

based on the obtained index, a third database supporting the Ngram index is searched based on the Ngram full text analyzer in MySQL, so that a search result about data to be searched is obtained.

2. The method of claim 1, wherein mapping each pinyin to its corresponding similar pinyin further comprises:

in response to the pinyin being mapped to a plurality of similar pinyins corresponding thereto, setting a first probability for each of the plurality of similar pinyins;

obtaining a second probability of each pinyin in the similar pinyin according to the obtained confirmation rate of the retrieval result;

calculating a third probability of each of the similar pinyin based on the first probability and the second probability; and

and searching the third database based on the calculated third probability, thereby obtaining a search result.

3. The method of claim 2, wherein calculating a third probability for each of the similar pinyins comprises:

acquiring an update time period of the second probability;

determining a weight value of the second probability in a different update time period based on the acquired update time period, wherein the weight value varies based on a functional relation with the update time period; and

a third probability of each of the similar pinyin is calculated based on the update time period, the first and second probabilities, and the changed weight value.

4. A method according to claim 3, further comprising:

and deleting each of the similar pinyin in the third database in response to the third probability of the similar pinyin being below a predetermined threshold.

5. A computing device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor;

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-4.

6. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-4.