CN116029284B - Chinese substring extraction method, chinese substring extraction system, storage medium and electronic equipment - Google Patents

Chinese substring extraction method, chinese substring extraction system, storage medium and electronic equipment Download PDF

Info

Publication number
CN116029284B
CN116029284B CN202310301303.6A CN202310301303A CN116029284B CN 116029284 B CN116029284 B CN 116029284B CN 202310301303 A CN202310301303 A CN 202310301303A CN 116029284 B CN116029284 B CN 116029284B
Authority
CN
China
Prior art keywords
substring
target
weight
split
chinese
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310301303.6A
Other languages
Chinese (zh)
Other versions
CN116029284A (en
Inventor
张强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Mido Technology Co ltd
Original Assignee
Shanghai Mdata Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Mdata Information Technology Co ltd filed Critical Shanghai Mdata Information Technology Co ltd
Priority to CN202310301303.6A priority Critical patent/CN116029284B/en
Publication of CN116029284A publication Critical patent/CN116029284A/en
Application granted granted Critical
Publication of CN116029284B publication Critical patent/CN116029284B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a Chinese substring extraction method, a Chinese substring extraction system, a storage medium and electronic equipment, wherein the Chinese substring extraction method comprises the following steps: splitting the Chinese text into a plurality of sentences; extracting a preset number of target substrings related to the sentence from the target substring set; extracting a plurality of split strings from each sentence; for each target substring, calculating the similarity with each split substring, and taking the split substring corresponding to the maximum similarity as a fuzzy substring when the maximum similarity is larger than a preset threshold; and checking the fuzzy substring to generate an effective fuzzy substring. According to the Chinese substring extraction method, the Chinese substring extraction system, the storage medium and the electronic equipment, the target substring is set, and the position of the fuzzy substring is rapidly positioned in a generalized mode, so that efficient and accurate substring extraction is realized.

Description

Chinese substring extraction method, chinese substring extraction system, storage medium and electronic equipment
Technical Field
The invention belongs to the technical field of character recognition, and particularly relates to a Chinese substring extraction method, a Chinese substring extraction system, a storage medium and electronic equipment.
Background
The subsequence of any number of consecutive characters in a string is called a substring of the string. In the prior art, substring extraction is generally performed in two ways.
(1) And a precise matching mode is adopted.
The KMP algorithm is an improved character string accurate matching algorithm, and has the core of reducing the matching times of the mode string and the main string to achieve the purpose of quick matching by utilizing information after matching failure. However, when the sub-strings to be searched have misplaced words, multiple words, few words and sequential exchange, the existing exact matching algorithm cannot search for the target sub-string. The method is characterized in that the precise matching can only adopt a mode of enumerating fuzzy substrings, but the fuzzy writing method of the substrings is difficult to be completed completely, and generalized extraction cannot be performed. For example: the target substring is "school Shi Mingli, school Shi Zeng letter, school Shi Chongde, school Shi Lihang", and the corresponding fuzzy substring may be "school time gift, school Shi Zeng letter, school Shi Chongde, school Shi Lihang"; "Xin Shi Ming Gift, xin, xue Shi Chongde, xue Shi Lihang of academic time, xue Shi Zeng"; "school time gift, school Shi Xin, school Shi Congde, school Shi Lihang"; "school time bright gift, school Shi Zeng letter, school Shi Chongde, school history line"; "academic theory, school Shi Zeng, school Shi Li lines, school Shi Chongde", resulting in an inability to precisely match the position.
(2) And a fuzzy search mode is adopted.
The fuzzy search mode can give results according to the occurrence number of each word of the word, and cannot give specific positions of the word. If a regular expression mode is adopted, regular expression configuration is needed for each word of the word, a large number of regular expressions are generated, the effect of generalized retrieval cannot be achieved, the efficiency is low, and the accuracy is low.
Disclosure of Invention
In view of the above-mentioned drawbacks of the prior art, an object of the present invention is to provide a method, a system, a storage medium and an electronic device for extracting a chinese sub-string, which are capable of quickly locating a position of a fuzzy sub-string in a generalized manner by setting a target sub-string, so as to implement efficient and accurate sub-string extraction.
In a first aspect, the present invention provides a method for extracting chinese substring, the method comprising the steps of: splitting the Chinese text into a plurality of sentences; extracting a preset number of target substrings related to the sentence from the target substring set; extracting a plurality of split strings from each sentence; for each target substring, calculating the similarity with each split substring, and taking the split substring corresponding to the maximum similarity as a fuzzy substring when the maximum similarity is larger than a preset threshold; and checking the fuzzy substring to generate an effective fuzzy substring.
In one implementation of the first aspect, splitting the chinese text into a plurality of sentences includes the steps of:
acquiring sentences based on periods in the Chinese text;
acquiring sentences based on the question marks in the Chinese text;
and acquiring sentences based on the exclamation mark in the Chinese text.
In one implementation manner of the first aspect, extracting a preset number of target substrings related to the sentence in the target substring set includes the steps of:
establishing a full text index for the Chinese text according to the Chinese characters for the target substrings in the target substrings;
searching the target substring matched with the sentence based on the full text index;
and selecting a preset number of target substrings matched with the sentences.
In one implementation manner of the first aspect, extracting the plurality of split strings in each sentence includes the steps of:
word segmentation is carried out on the sentences to obtain a plurality of words;
and combining the words front and back to obtain the split substring, wherein the length of the split substring is in a preset proportion interval of the length of the target substring.
In one implementation of the first aspect, calculating the similarity to each split substring includes the steps of:
extracting characters, pinyin and Bi-Gram arrays of the target substring and the split substring;
acquiring the equal number of characters, the equal number of pinyin and the equal number of Bi-Gram arrays of the target substring and the split substring;
calculating punctuation weight according to punctuation weight= (target substring punctuation number-source substring punctuation number)/4.0;
calculating character weight, pinyin weight and Bi-Gram array weight, wherein the character weight=2.0 (equal number of characters+punctuation weight)/(target substring length+split substring length), the pinyin weight=2.0 (equal number of pinyin+punctuation weight)/(target substring length+split substring length), and the Bi-Gram array weight=1.0 (equal number of Bi-Gram array+punctuation weight)/(target substring length+split substring length);
and calculating the similarity according to the similarity = character weight × pinyin weight × Bi-Gram array weight.
In an implementation manner of the first aspect, when the head-tail characters of the target substring and the split substring are the same, the Bi-Gram array weight is updated to Bi-Gram array weight=0.4+0.6 (Bi-Gram array equal number+punctuation weight×2)/(target substring length+split substring length).
In one implementation manner of the first aspect, verifying the fuzzy substring includes the following steps:
when the fuzzy substring is one word less than the target substring and the adjacent word segmentation of the fuzzy substring is a single word, the single word is supplemented and aligned;
when two target substrings correspond to the same fuzzy substring, the target substring with higher similarity is selected to correspond to the fuzzy substring.
In a second aspect, the invention provides a Chinese substring extraction system, which comprises a splitting module, a first extraction module, a second extraction module, a calculation module and a verification module;
the splitting module is used for splitting the Chinese text into a plurality of sentences;
the first extraction module is used for extracting a preset number of target substrings related to the sentence in a target substring set;
the second extraction module is used for extracting a plurality of split molecular strings from each sentence;
the computing module is used for computing the similarity of each split substring for each target substring, and taking the split substring corresponding to the maximum similarity as a fuzzy substring when the maximum similarity is greater than a preset threshold;
and the verification module is used for verifying the fuzzy substring to generate an effective fuzzy substring.
In a third aspect, the present invention provides an electronic device comprising: a processor and a memory;
the memory is used for storing a computer program;
the processor is used for executing the computer program stored in the memory so as to enable the electronic equipment to execute the Chinese substring extraction method.
In a fourth aspect, the present invention provides a computer readable storage medium having stored thereon a computer program, wherein the program when executed by an electronic device implements the chinese substring extraction method described above.
As described above, the Chinese substring extraction method, the Chinese substring extraction system, the storage medium and the electronic equipment have the following beneficial effects.
The Chinese substring extraction method, the Chinese substring extraction system, the storage medium and the electronic equipment are positioned to the fuzzy substring in a generalized manner, so that the Chinese substring extraction method, the Chinese substring extraction system and the electronic equipment are fast and efficient, and the system load is reduced; the problems caused by accurate matching and regular expression matching of the substring in the prior art are solved, and the accuracy and the speed of substring matching are effectively improved.
Drawings
Fig. 1 is a schematic view of an electronic device according to an embodiment of the invention.
Fig. 2 is a flowchart illustrating a chinese string extracting method according to an embodiment of the invention.
Fig. 3 is a schematic diagram of a chinese string extracting system according to an embodiment of the invention.
Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the invention.
Description of element reference numerals
11-mobile phone
12-tablet personal computer
13-notebook computer
31-split module
32-first extraction module
33-second extraction module
34-computing Module
35-check module
41-processing unit
42-memory
421-RAM
422-cache memory
423-memory system
424-procedure/utility
4241 program modules
43-bus
44-input/output interface
45-network adapter
Detailed Description
Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict.
It should be noted that the illustrations provided in the following embodiments merely illustrate the basic concept of the present invention by way of illustration, and only the components related to the present invention are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complicated.
The following embodiments of the present invention provide a chinese substring extraction method, which may be applied to an electronic device as shown in fig. 1. The electronic device in the present invention may include a mobile phone 11, a tablet computer 12, a notebook computer 13, a wearable device, a vehicle-mounted device, an augmented Reality (Augmented Reality, AR)/Virtual Reality (VR) device, an Ultra-Mobile Personal Computer (UMPC), a netbook, a personal digital assistant (Personal Digital Assistant, PDA) and the like with a wireless charging function, and the specific type of the electronic device is not limited in the embodiments of the present invention.
For example, the electronic device may be a Station (ST) in a wireless charging enabled WLAN, a wireless charging enabled cellular telephone, a cordless telephone, a Session initiation protocol (Session InitiationProtocol, SIP) telephone, a wireless local loop (WirelessLocal Loop, WLL) station, a personal digital assistant (Personal Digital Assistant, PDA) device, a wireless charging enabled handheld device, a computing device or other processing device, a computer, a laptop computer, a handheld communication device, a handheld computing device, and/or other devices for communicating over a wireless system, as well as next generation communication systems, such as a mobile terminal in a 5G network, a mobile terminal in a future evolved public land mobile network (PublicLand Mobile Network, PLMN), or a mobile terminal in a future evolved Non-terrestrial network (Non-terrestrial Network, NTN), etc.
For example, the electronic device may communicate with networks and other devices via wireless communications. The wireless communications may use any communication standard or protocol including, but not limited to, global system for mobile communications (GlobalSystem of Mobile communication, GSM), general Packet radio service (General Packet RadioService, GPRS), code division multiple access (Code Division Multiple Access, CDMA), wideband code division multiple access (Wideband Code Division Multiple Access, WCDMA), long term evolution (Long Term Evolution, LTE)), email, short message service (Short Messaging Service, SMS), BT, GNSS, WLAN, NFC, FM, and/or IR techniques, among others. The GNSS may include a global satellite positioning system (Global Positioning System, GPS), a global navigation satellite system (Global Navigation Satellite System, GLONASS), a beidou satellite navigation system (BeiDou navigation Satellite System, BDS), a Quasi zenith satellite system (Quasi-Zenith Satellite System, QZSS) and/or a satellite based augmentation system (Satellite Based Augmentation Systems, SBAS).
The following describes the technical solution in the embodiment of the present invention in detail with reference to the drawings in the embodiment of the present invention.
In one embodiment, as shown in fig. 2, the chinese string extracting method of the present invention includes the following steps.
And S1, splitting the Chinese text into a plurality of sentences.
Specifically, the chinese text is divided based on a period, a question mark, and an exclamation mark in the chinese text, thereby obtaining a plurality of sentences independent of each other. It should be noted that if the chinese text belongs to the HTML text, the sentence splitting is performed after the tag is removed.
And S2, extracting a preset number of target substrings related to the sentence from the target substring set.
Specifically, the target substrings that need to be extracted are combined to form a target substring set. For each sentence, a preset number of target substrings, such as 5 and 10, which are most relevant to the sentence need to be found. In the invention, a full text indexing technology, also called an inverted document technology, is adopted to search the target substring. The principle of the full text index is to define a word stock first, then search the frequency and position of each term (term) in the article, and to inductive the frequency and position information according to the sequence of the word stock, which is equivalent to establishing an index with the word stock as a catalog for the file, so that the position of the term can be quickly located when searching for the term. Preferably, the full text indexing technique employs the Lucene algorithm.
Preferably, extracting a preset number of target substrings related to the sentence among the target substrings includes the following steps.
21 And establishing full-text indexes for the Chinese text according to the Chinese characters for the target substrings in the target substrings.
And constructing the full-text index of the target substring to the Chinese text based on a full-text index technology. The full text index is indexed using the position index of each sentence in the Chinese text.
22 Searching the target substring matched with the sentence based on the full text index.
Wherein for each sentence, a matching target substring is found based on the full text index.
23 A preset number of target substrings matching the sentence are selected.
And S3, extracting a plurality of split molecular strings from each sentence.
Specifically, extracting a plurality of split strings in each sentence includes the following steps.
31 The sentence is segmented to obtain a plurality of words.
The word segmentation refers to splitting a text into a series of words, wherein the words are spliced to be equal to the original text, and the word segmentation has semantic rationality and complete vocabulary sequence. Preferably, the word segmentation is performed using a conditional random field (Conditional Random Field CRF) word segmentation algorithm, an N-shortest path word segmentation algorithm, or the like.
32 And combining the words front and back to obtain the split substring, wherein the length of the split substring is in a preset proportion interval of the length of the target substring.
And combining the words front and back to obtain a plurality of substrings which are used as the sub-substrings. Preferably, the length of the split substring is limited to between 1/2 length and 3 length of the target substring. For example, if the length of the target substring is preset to be 10, the length of the split substring is minimum to be 1/2 length=5, and maximum to be length+3=13. Therefore, the length of the split substring is 5-13.
And S4, calculating the similarity of each split substring for each target substring, and taking the split substring corresponding to the maximum similarity as a fuzzy substring when the maximum similarity is larger than a preset threshold.
Specifically, for each target substring, each split substring is traversed, and the similarity between the two is calculated. Wherein calculating the similarity to each split substring comprises the following steps.
41 Extracting the characters, pinyin and Bi-Gram arrays of the target substring and the split substring.
The N-Gram model is a statistical language model for representing sentences, wherein the N-Gram model is to perform sliding window operation with the size of N on the content in the text according to characters to form a character segment sequence with the length of N. Each byte segment is called Gram, statistics is carried out on the occurrence frequency of all the Gram, and filtering is carried out according to a preset threshold value, so that a key Gram list, namely a vector feature space of the text, is formed. Each Gram in the list is a feature vector dimension. The Gram model is based on the assumption that the occurrence of the nth word is related to only the preceding N-1 words, but not to any other word, and the probability of the whole sentence is the product of the occurrence probabilities of the respective words. These probabilities can be obtained by directly counting the number of simultaneous occurrences of N words from the corpus. Bi-Gram refers to a binary Gram model.
42 The equal number of characters, the equal number of pinyin and the equal number of Bi-Gram arrays of the target substring and the split substring are obtained. For example, the target substring is: digital debugging (shu zi cha cuo), the split substring is: number error (shu zi cha cuo), then the number of equal characters is 3 and the number of equal pinyin is 4.
Wherein, the flat and the upwarp tongues of the default pinyin are equal.
43 The punctuation weight is calculated from the punctuation weight = (target substring punctuation number-source substring punctuation number)/4.0.
44 Calculating character weight, pinyin weight and Bi-Gram array weight, wherein the character weight=2.0 (equal number of characters + punctuation weight)/(target substring length + split substring length), the pinyin weight=2.0 (equal number of pinyin + punctuation weight)/(target substring length + split substring length), and the Bi-Gram array weight=1.0 (equal number of Bi-Gram array + punctuation weight)/(target substring length + split substring length).
45 Calculating the similarity according to the similarity=character weight×pinyin weight×bi-Gram array weight.
Preferably, when the head and tail characters of the target substring and the split substring are the same, the target substring and the split substring are considered to be more similar, and the Bi-Gram array weight is required to be increased. The Bi-Gram array weight is updated to Bi-Gram array weight = 0.4+0.6 (Bi-Gram array equal number + punctuation weight x 2)/(target substring length + split substring length).
After the similarity calculation is completed, the maximum similarity is selected. And if the maximum similarity is greater than a preset threshold, taking the split substring corresponding to the maximum similarity as a fuzzy substring.
And S5, checking the fuzzy substring to generate an effective fuzzy substring.
Specifically, after the fuzzy substring is obtained, further verification is required to obtain a more accurate effective fuzzy substring.
In one embodiment, verifying the fuzzy substring includes the following steps.
51 When the fuzzy substring is one word less than the target substring and the adjacent word segmentation of the fuzzy substring is a single word, the single word is subjected to supplementary alignment.
Where word processing is a boundary processing problem. For example, the original sentence is: the medicine health system changes grass, and the target substring is: the medicine health system reforms, and the fuzzy substring is as follows: when the medical and health system is changed, the fuzzy substring needs to be checked. Then the adjacent single word is complemented, and after the fuzzy substring is verified, the method comprises the following steps: the medical health system changes grass. The grass changing is not a word, belongs to the condition of five strokes of wrong characters, and only supplements an adjacent single character. If the original sentence is: when the medical and health system changes the draft, and the adjacent draft is a word instead of a single word, the supplement is not carried out. For another example, the original sentence is: the visit plan of the museum, the target substring is: museum visit specifies that the fuzzy substring is: the visit gauge of the museum is verified by the fuzzy substring: visit plan for museums. And according to the subscript position of the fuzzy substring in the sentence, finding that the word segmentation corresponding to the end position of the fuzzy substring is a word planning, and supplementing the independent word segmentation condition of the end of the fuzzy substring.
52 When two target substrings correspond to the same fuzzy substring, the target substring with higher similarity is selected to correspond to the fuzzy substring.
When the two fuzzy substrings are overlapped, for example, the two target substrings are a class labor order list and a class labor order list in Guangming, respectively. When the Chinese text appears in the table of class labor sequencing in Guangming, the table can serve as a fuzzy substring of two target substrings at the same time. At this time, the target substring "Guangming middle school class labor order table" is selected to correspond to the fuzzy substring "Guangming middle school class labor order table".
The protection scope of the chinese substring extraction method according to the embodiments of the present invention is not limited to the order of execution of the steps listed in the embodiments, and all the schemes implemented by adding or removing steps and replacing steps according to the prior art made by the principles of the present invention are included in the protection scope of the present invention.
The embodiment of the invention also provides a Chinese sub-string extraction system which can realize the Chinese sub-string extraction method, but the realization device of the Chinese sub-string extraction system comprises but is not limited to the structure of the Chinese sub-string extraction system listed in the embodiment, and all structural variations and substitutions of the prior art according to the principles of the invention are included in the protection scope of the invention.
As shown in fig. 3, in an embodiment, the chinese string extracting system of the present invention includes a splitting module 31, a first extracting module 32, a second extracting module 33, a calculating module 34 and a checking module 35.
The splitting module 31 is configured to split the chinese text into a plurality of sentences.
The first extraction module 32 is connected to the splitting module 31, and is configured to extract a preset number of target substrings related to the sentence from the target substring set.
The second extraction module 33 is connected to the splitting module 31, and is configured to extract a plurality of split strings from each sentence.
The calculating module 34 is connected to the first extracting module 32 and the second extracting module 33, and is configured to calculate, for each target substring, a similarity with each split substring, and when the maximum similarity is greater than a preset threshold, take the split substring corresponding to the maximum similarity as a fuzzy substring.
The verification module 35 is connected to the calculation module 34, and is configured to verify the fuzzy substring to generate an effective fuzzy substring.
The structures and principles of the splitting module 31, the first extracting module 32, the second extracting module 33, the calculating module 34 and the checking module 35 are in one-to-one correspondence with the steps in the above-mentioned chinese substring extracting method, so that the description thereof will not be repeated here.
In the several embodiments provided in the present invention, it should be understood that the disclosed system, apparatus, or method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules/units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple modules or units may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or modules or units, which may be in electrical, mechanical or other forms.
The modules/units illustrated as separate components may or may not be physically separate, and components shown as modules/units may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules/units may be selected according to actual needs to achieve the objectives of the embodiments of the present invention. For example, functional modules/units in various embodiments of the invention may be integrated into one processing module, or each module/unit may exist alone physically, or two or more modules/units may be integrated into one module/unit.
Those of ordinary skill would further appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The embodiment of the invention also provides a computer readable storage medium. Those of ordinary skill in the art will appreciate that all or part of the steps in the method implementing the above embodiments may be implemented by a program to instruct a processor, where the program may be stored in a computer readable storage medium, where the storage medium is a non-transitory (non-transitory) medium, such as a random access memory, a read only memory, a flash memory, a hard disk, a solid state disk, a magnetic tape (magnetic tape), a floppy disk (floppy disk), an optical disk (optical disk), and any combination thereof. The storage media may be any available media that can be accessed by a computer or a data storage device such as a server, data center, or the like that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a digital video disc (digital video disc, DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.
The embodiment of the invention also provides electronic equipment. The electronic device includes a processor and a memory.
The memory is used for storing a computer program.
The memory includes: various media capable of storing program codes, such as ROM, RAM, magnetic disk, U-disk, memory card, or optical disk.
The processor is connected with the memory and is used for executing the computer program stored in the memory so as to enable the electronic equipment to execute the Chinese substring extraction method.
Preferably, the processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, abbreviated as CPU), a network processor (Network Processor, abbreviated as NP), etc.; but also digital signal processors (Digital Signal Processor, DSP for short), application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), field programmable gate arrays (Field Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
As shown in fig. 4, the electronic device of the present invention is embodied in the form of a general purpose computing device. Components of an electronic device may include, but are not limited to: one or more processors or processing units 41, a memory 42, a bus 43 connecting the different system components, including the memory 42 and the processing unit 41.
Bus 43 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, micro channel architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Electronic devices typically include a variety of computer system readable media. Such media can be any available media that can be accessed by the electronic device and includes both volatile and nonvolatile media, removable and non-removable media.
Memory 42 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 621 and/or cache memory 422. The electronic device may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, the storage system 423 may be used to read from and write to non-removable, non-volatile magnetic media (not shown in FIG. 4, commonly referred to as a "hard disk drive"). Although not shown in fig. 4, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be coupled to bus 43 through one or more data media interfaces. Memory 42 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the invention.
A program/utility 424 having a set (at least one) of program modules 4241 may be stored in, for example, memory 42, such program modules 4241 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 4241 generally perform the functions and/or methodologies of the described embodiments of the invention.
The electronic device may also communicate with one or more external devices (e.g., keyboard, pointing device, display, etc.), with one or more devices that enable a user to interact with the electronic device, and/or with any device (e.g., network card, modem, etc.) that enables the electronic device to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 44. And the electronic device may also communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet, via the network adapter 45. As shown in fig. 4, the network adapter 45 communicates with other modules of the electronic device over the bus 43. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with an electronic device, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
The above embodiments are merely illustrative of the principles of the present invention and its effectiveness, and are not intended to limit the invention. Modifications and variations may be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the invention. Accordingly, it is intended that all equivalent modifications and variations of the invention be covered by the claims, which are within the ordinary skill of the art, be within the spirit and scope of the present disclosure.

Claims (9)

1. A method for extracting chinese substring, the method comprising the steps of:
splitting the Chinese text into a plurality of sentences;
extracting a preset number of target substrings related to the sentence from the target substring set;
extracting a plurality of split strings from each sentence;
for each target substring, calculating the similarity with each split substring, and taking the split substring corresponding to the maximum similarity as a fuzzy substring when the maximum similarity is larger than a preset threshold;
checking the fuzzy substring to generate an effective fuzzy substring;
calculating the similarity with each split substring comprises the following steps:
extracting characters, pinyin and Bi-Gram arrays of the target substring and the split substring;
acquiring the equal number of characters, the equal number of pinyin and the equal number of Bi-Gram arrays of the target substring and the split substring;
calculating punctuation weight according to punctuation weight= (target substring punctuation number-source substring punctuation number)/4.0;
calculating character weight, pinyin weight and Bi-Gram array weight, wherein the character weight=2.0 (equal number of characters+punctuation weight)/(target substring length+split substring length), the pinyin weight=2.0 (equal number of pinyin+punctuation weight)/(target substring length+split substring length), and the Bi-Gram array weight=1.0 (equal number of Bi-Gram array+punctuation weight)/(target substring length+split substring length);
and calculating the similarity according to the similarity = character weight × pinyin weight × Bi-Gram array weight.
2. The chinese substring extraction method of claim 1, wherein: splitting the chinese text into a plurality of sentences comprises the steps of:
acquiring sentences based on periods in the Chinese text;
acquiring sentences based on the question marks in the Chinese text;
and acquiring sentences based on the exclamation mark in the Chinese text.
3. The chinese substring extraction method of claim 1, wherein: extracting a preset number of target substrings related to the sentence in the target substring set comprises the following steps:
establishing a full text index for the Chinese text according to the Chinese characters for the target substrings in the target substrings;
searching the target substring matched with the sentence based on the full text index;
and selecting a preset number of target substrings matched with the sentences.
4. The chinese substring extraction method of claim 1, wherein: extracting a plurality of split strings in each sentence comprises the steps of:
word segmentation is carried out on the sentences to obtain a plurality of words;
and combining the words front and back to obtain the split substring, wherein the length of the split substring is in a preset proportion interval of the length of the target substring.
5. The chinese substring extraction method of claim 1, wherein: when the head and tail characters of the target substring and the split substring are the same, the Bi-Gram array weight is updated to Bi-Gram array weight=0.4+0.6 (equal number of Bi-Gram arrays+punctuation weight×2)/(target substring length+split substring length).
6. The chinese substring extraction method of claim 1, wherein: the checking of the fuzzy substring comprises the following steps:
when the fuzzy substring is one word less than the target substring and the adjacent word segmentation of the fuzzy substring is a single word, the single word is supplemented and aligned;
when two target substrings correspond to the same fuzzy substring, the target substring with higher similarity is selected to correspond to the fuzzy substring.
7. The Chinese substring extraction system is characterized by comprising a splitting module, a first extraction module, a second extraction module, a calculation module and a verification module;
the splitting module is used for splitting the Chinese text into a plurality of sentences;
the first extraction module is used for extracting a preset number of target substrings related to the sentence in a target substring set;
the second extraction module is used for extracting a plurality of split molecular strings from each sentence;
the computing module is used for computing the similarity of each split substring for each target substring, and taking the split substring corresponding to the maximum similarity as a fuzzy substring when the maximum similarity is greater than a preset threshold;
the verification module is used for verifying the fuzzy substring to generate an effective fuzzy substring;
calculating the similarity with each split substring comprises the following steps:
extracting characters, pinyin and Bi-Gram arrays of the target substring and the split substring;
acquiring the equal number of characters, the equal number of pinyin and the equal number of Bi-Gram arrays of the target substring and the split substring;
calculating punctuation weight according to punctuation weight= (target substring punctuation number-source substring punctuation number)/4.0;
calculating character weight, pinyin weight and Bi-Gram array weight, wherein the character weight=2.0 (equal number of characters+punctuation weight)/(target substring length+split substring length), the pinyin weight=2.0 (equal number of pinyin+punctuation weight)/(target substring length+split substring length), and the Bi-Gram array weight=1.0 (equal number of Bi-Gram array+punctuation weight)/(target substring length+split substring length);
and calculating the similarity according to the similarity = character weight × pinyin weight × Bi-Gram array weight.
8. An electronic device, the electronic device comprising: a processor and a memory;
the memory is used for storing a computer program;
the processor is configured to execute the computer program stored in the memory, so that the electronic device executes the chinese sub-string extraction method according to any one of claims 1 to 6.
9. A computer-readable storage medium having stored thereon a computer program, characterized in that the program, when executed by an electronic device, implements the chinese substring extraction method of any one of claims 1 to 6.
CN202310301303.6A 2023-03-27 2023-03-27 Chinese substring extraction method, chinese substring extraction system, storage medium and electronic equipment Active CN116029284B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310301303.6A CN116029284B (en) 2023-03-27 2023-03-27 Chinese substring extraction method, chinese substring extraction system, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310301303.6A CN116029284B (en) 2023-03-27 2023-03-27 Chinese substring extraction method, chinese substring extraction system, storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN116029284A CN116029284A (en) 2023-04-28
CN116029284B true CN116029284B (en) 2023-07-21

Family

ID=86076211

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310301303.6A Active CN116029284B (en) 2023-03-27 2023-03-27 Chinese substring extraction method, chinese substring extraction system, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN116029284B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101369278A (en) * 2008-09-27 2009-02-18 成都市华为赛门铁克科技有限公司 Approximate adaptation method and apparatus
CN115270768A (en) * 2022-04-19 2022-11-01 上海蜜度信息技术有限公司 Method and equipment for determining target key words to be corrected in text

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101329680B (en) * 2008-07-17 2010-12-08 安徽科大讯飞信息科技股份有限公司 Large scale rapid matching method of sentence surface
CN111324784B (en) * 2015-03-09 2023-05-16 创新先进技术有限公司 Character string processing method and device
CN104750846B (en) * 2015-04-10 2017-12-08 浪潮集团有限公司 A kind of substring lookup method and device
JP6583686B2 (en) * 2015-06-17 2019-10-02 パナソニックIpマネジメント株式会社 Semantic information generation method, semantic information generation device, and program
CN109710833B (en) * 2018-12-29 2021-07-16 上海蜜度信息技术有限公司 Method and apparatus for determining content node
CN111930792B (en) * 2020-06-23 2024-04-12 北京大米科技有限公司 Labeling method and device for data resources, storage medium and electronic equipment
CN114781008B (en) * 2022-04-15 2022-10-28 山东省计算中心(国家超级计算济南中心) Data identification method and device for security detection of terminal firmware of Internet of things
CN115544999A (en) * 2022-10-20 2022-12-30 南京大学 Domain-oriented parallel large-scale text duplicate checking method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101369278A (en) * 2008-09-27 2009-02-18 成都市华为赛门铁克科技有限公司 Approximate adaptation method and apparatus
CN115270768A (en) * 2022-04-19 2022-11-01 上海蜜度信息技术有限公司 Method and equipment for determining target key words to be corrected in text

Also Published As

Publication number Publication date
CN116029284A (en) 2023-04-28

Similar Documents

Publication Publication Date Title
US11024287B2 (en) Method, device, and storage medium for correcting error in speech recognition result
CN109670163B (en) Information identification method, information recommendation method, template construction method and computing device
US10210243B2 (en) Method and system for enhanced query term suggestion
KR101465770B1 (en) Word probability determination
KR101435265B1 (en) Method for disambiguating multiple readings in language conversion
CN112016304A (en) Text error correction method and device, electronic equipment and storage medium
US20080208566A1 (en) Automated word-form transformation and part of speech tag assignment
CN103558908A (en) Techniques for assisting a user in the textual input of names of entities to a user device in multiple different languages
WO2016095645A1 (en) Stroke input method, device and system
JP2015022590A (en) Character input apparatus, character input method, and character input program
CN104281275B (en) The input method of a kind of English and device
Na et al. Phrase-based statistical model for korean morpheme segmentation and POS tagging
CN110941951A (en) Text similarity calculation method, text similarity calculation device, text similarity calculation medium and electronic equipment
CN104572618A (en) Question-answering system semantic-based similarity analyzing method, system and application
CN113918031A (en) System and method for Chinese punctuation recovery using sub-character information
CN116029284B (en) Chinese substring extraction method, chinese substring extraction system, storage medium and electronic equipment
Makazhanov et al. Spelling correction for kazakh
JP6419899B1 (en) Information processing apparatus, control method, and control program
US10789410B1 (en) Identification of source languages for terms
Bandyopadhyay et al. HMM based POS Tagger and Rule-based Chunker for Bengali
US20180052819A1 (en) Predicting terms by using model chunks
Peng et al. Less than One-shot: Named Entity Recognition via Extremely Weak Supervision
CN117688927B (en) Medical record chapter reconfiguration method, system, terminal and storage medium
CN111191473B (en) Method and device for acquiring translation text file
CN116861889A (en) Lisu text error correction method, lisu text error correction system, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address
CP03 Change of name, title or address

Address after: Room 301ab, No.10, Lane 198, zhangheng Road, China (Shanghai) pilot Free Trade Zone, Pudong New Area, Shanghai 201204

Patentee after: Shanghai Mido Technology Co.,Ltd.

Address before: Room 301ab, No. 10, Lane 198, zhangheng Road, Pudong New Area pilot Free Trade Zone, Shanghai, China, 201204

Patentee before: SHANGHAI MDATA INFORMATION TECHNOLOGY Co.,Ltd.