CN112527957A - Short text matching method and system applied to news field - Google Patents

Short text matching method and system applied to news field Download PDF

Info

Publication number
CN112527957A
CN112527957A CN202011424390.7A CN202011424390A CN112527957A CN 112527957 A CN112527957 A CN 112527957A CN 202011424390 A CN202011424390 A CN 202011424390A CN 112527957 A CN112527957 A CN 112527957A
Authority
CN
China
Prior art keywords
prefix
news
words
matched
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011424390.7A
Other languages
Chinese (zh)
Inventor
张友豪
冯卫强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Financial China Information & Technology Co ltd
Original Assignee
Shanghai Financial China Information & Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Financial China Information & Technology Co ltd filed Critical Shanghai Financial China Information & Technology Co ltd
Priority to CN202011424390.7A priority Critical patent/CN112527957A/en
Publication of CN112527957A publication Critical patent/CN112527957A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/325Hash tables

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a short text matching method and a short text matching system applied to the news field, wherein the short text matching method comprises the following steps: step M1: constructing a mechanism index for the mechanism words to be matched by using a k-word prefix tree method; step M2: storing the mechanism index and news to be matched according to a preset format; step M3: and carrying out news mechanism matching according to the news to be matched and the mechanism index. The method and the device can quickly match related mechanisms in massive news data, solve the problem of low matching efficiency of the news data, improve the query efficiency and save the storage space.

Description

Short text matching method and system applied to news field
Technical Field
The invention relates to the technical field of data processing and news retrieval, in particular to a short text matching method and a short text matching system applied to the news field; and more particularly, to a method and system for string processing and high concurrency news agency matching.
Background
With the development of the internet, under the situation of continuous improvement of science and technology, data enters a big outbreak era, and particularly various news emerge endlessly. How to quickly acquire organizations in news in massive news becomes an important technology in the field of news data processing.
Two main challenges are faced in the current news agency matching technology development process: the first is the problem of complexity of matching time, with the arrival of a big data era, the news data volume is increased rapidly, the matching characteristics are more and more, and the matching process is more and more complicated; the second challenge is the efficiency requirement, and as the internet develops, the timeliness requirement of data becomes higher and higher, and the requirement on the processing capacity of the mechanism matching system is high.
In order to solve the difficulties, the system adopts a K-word prefix tree method to construct indexes for tens of millions of mechanisms, and utilizes a Redis cluster to perform distributed index storage, so that the large space complexity is greatly reduced, and the system has the advantages of compromising the suffix number and the suffix array in terms of calculation space and search speed. And meanwhile, a KMP algorithm is adopted, so that the matching performance is improved.
Patent document CN110321562A (application number: 201910576788.3) discloses a BERT-based short text matching method, which obtains first supervised task data of a first scene according to a requirement of the first scene, performs noise reduction processing on the first supervised task data to generate first data, extracts a first keyword from the first data, performs conversion processing on the first data and the first keyword to generate a first original expression and a first feature expression, inputs the first original expression and the first feature expression to a preset short text matching model respectively, generates a first score of the first original expression and a second score of the first feature expression, and finally determines whether the first score and/or the second score reach a preset threshold, if so, determines that the first supervised task data belongs to a positive sample, otherwise determines that the first supervised task data belongs to a negative sample, the method can play the role of prior knowledge to the maximum extent under the condition of limited supervision task data, and has stronger robustness and interpretability.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a short text matching method and a short text matching system applied to the news field.
The short text matching method applied to the news field provided by the invention comprises the following steps:
step M1: constructing a mechanism index for the mechanism words to be matched by using a K-word prefix tree method;
step M2: storing the mechanism index and news to be matched according to a preset format;
step M3: and carrying out news mechanism matching according to the news to be matched and the mechanism index.
Preferably, the step M1 includes:
step M1.1: the mechanism words comprise N characters, K characters before the mechanism words are selected as mechanism word prefixes, and the N-K characters are used as mechanism word suffixes;
step M1.2: and constructing a prefix tree by taking the prefix words of the K characters as key values and taking the mechanism suffix words with the same prefix words as value values.
Preferably, said step M1.2 comprises: and when the value list size of the mechanism suffix words with the same prefix words exceeds a preset value, carrying out prefix length expansion to ensure that the value list size of each key value is in a preset range.
Preferably, the step M2 includes:
step M2.1: converting K word prefix words in the mechanism index into hash codes through a hash algorithm, storing the hash codes, and storing the hash codes as a prefix word dictionary;
step M2.2: and coding and storing the mechanisms in the value list in the mechanism index.
Preferably, the step M3 includes:
step M3.1: carrying out formatting pretreatment on news to be matched of files with different formats to obtain the pretreated news to be matched;
step M3.2: carrying out sentence segmentation and word segmentation on the preprocessed news to be matched according to a preset rule;
step M3.3: performing mechanism prefix matching and mechanism full-name matching according to the mechanism index;
step M3.4: and performing data filtering processing on the matched mechanism, and outputting the matched mechanism.
Preferably, said step M3.3 comprises:
step M3.3.1: loading a prefix file to obtain a prefix word dictionary;
step M3.3.2: circulating sentence subsets of news to be matched, comparing K-word short words in each sentence with a prefix word dictionary, and performing mechanism full-name matching on the sentences containing the prefix words and a value list corresponding to the prefix words when the short words exist in the prefix word dictionary; when the short word does not exist in the dictionary of the prefix word, the step M3.3.2 is repeatedly executed; and when the sentence containing the prefix word does not have the mechanism matched with the value list, repeatedly executing the step M3.3.2 until the matching of the news to be matched is finished.
The invention provides a short text matching system applied to the news field, which comprises the following components:
module M1: constructing a mechanism index for the mechanism words to be matched by using a K-word prefix tree method;
module M2: storing the mechanism index and news to be matched according to a preset format;
module M3: and carrying out news mechanism matching according to the news to be matched and the mechanism index.
Preferably, said module M1 comprises:
module M1.1: the mechanism words comprise N characters, K characters before the mechanism words are selected as mechanism word prefixes, and the N-K characters are used as mechanism word suffixes;
module M1.2: constructing a prefix tree by taking K-character prefix words as key values and taking mechanism suffix words with the same prefix words as value values;
the module M1.2 comprises: and when the value list size of the mechanism suffix words with the same prefix words exceeds a preset value, carrying out prefix length expansion to ensure that the value list size of each key value is in a preset range.
Preferably, said module M2 comprises:
module M2.1: converting K word prefix words in the mechanism index into hash codes through a hash algorithm, storing the hash codes, and storing the hash codes as a prefix word dictionary;
module M2.2: and coding and storing the mechanisms in the value list in the mechanism index.
Preferably, said module M3 comprises:
module M3.1: carrying out formatting pretreatment on news to be matched of files with different formats to obtain the pretreated news to be matched;
module M3.2: carrying out sentence segmentation and word segmentation on the preprocessed news to be matched according to a preset rule;
module M3.3: performing mechanism prefix matching and mechanism full-name matching according to the mechanism index;
module M3.4: performing data filtering processing on the matched mechanism, and outputting the matched mechanism;
said module M3.3 comprises:
module M3.3.1: loading a prefix file to obtain a prefix word dictionary;
module M3.3.2: circulating sentence subsets of news to be matched, comparing K-word short words in each sentence with a prefix word dictionary, and performing mechanism full-name matching on the sentences containing the prefix words and a value list corresponding to the prefix words when the short words exist in the prefix word dictionary; when the short word does not exist in the prefix word dictionary, the triggering module M3.3.2 is repeatedly triggered to execute; when the sentence containing the prefix word is matched with the mechanism in the value list, the matching structure is added into the result list, and when the sentence containing the prefix word is not matched with the mechanism in the value list, the triggering module M3.3.2 is repeatedly triggered to execute until the matching of the news to be matched is finished.
Compared with the prior art, the invention has the following beneficial effects:
1. the invention provides a method for constructing and storing a text index in a distributed manner, which improves the query efficiency;
2. the invention provides a method and a system for matching character strings, which aim to solve the technical problem of low data matching efficiency under the condition of mass data;
3. the method and the device can quickly match related mechanisms in massive news data, solve the problem of low matching efficiency of the news data, improve the query efficiency and save the storage space.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is a schematic diagram of a prefix tree construction;
FIG. 2 is a comparison of different prefix length efficiencies;
fig. 3 is a news agency matching flow chart.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.
Example 1
The short text matching method applied to the news field provided by the invention comprises the following steps: as shown in fig. 1-3;
step M1: constructing a mechanism index for the mechanism words to be matched by using a K-word prefix tree method;
step M2: storing the mechanism index and news to be matched according to a preset format;
step M3: and carrying out news mechanism matching according to the news to be matched and the mechanism index.
Specifically, the step M1 includes:
step M1.1: the mechanism words comprise N characters, K characters before the mechanism words are selected as mechanism word prefixes, and the N-K characters are used as mechanism word suffixes;
step M1.2: and constructing a prefix tree by taking the prefix words of the K characters as key values and taking the mechanism suffix words with the same prefix words as value values.
In particular, said step M1.2 comprises: and when the value list size of the mechanism suffix words with the same prefix words exceeds a preset value, carrying out prefix length expansion to ensure that the value list size of each key value is in a preset range.
Specifically, the step M2 includes:
step M2.1: converting K word prefix words in the mechanism index into hash codes through a hash algorithm, storing the hash codes, and storing the hash codes as a prefix word dictionary;
step M2.2: and coding and storing the mechanisms in the value list in the mechanism index.
Specifically, the step M3 includes:
step M3.1: carrying out formatting pretreatment on news to be matched of files with different formats to obtain the pretreated news to be matched;
step M3.2: carrying out sentence segmentation and word segmentation on the preprocessed news to be matched according to a preset rule;
step M3.3: performing mechanism prefix matching and mechanism full-name matching according to the mechanism index;
step M3.4: and performing data filtering processing on the matched mechanism, and outputting the matched mechanism.
In particular, said step M3.3 comprises:
step M3.3.1: loading a prefix file to obtain a prefix word dictionary;
step M3.3.2: circulating sentence subsets of news to be matched, comparing K-word short words in each sentence with a prefix word dictionary, and performing mechanism full-name matching on the sentences containing the prefix words and a value list corresponding to the prefix words when the short words exist in the prefix word dictionary; when the short word does not exist in the dictionary of the prefix word, the step M3.3.2 is repeatedly executed; and when the sentence containing the prefix word does not have the mechanism matched with the value list, repeatedly executing the step M3.3.2 until the matching of the news to be matched is finished.
The invention provides a short text matching system applied to the news field, which comprises the following components:
module M1: constructing a mechanism index for the mechanism words to be matched by using a K-word prefix tree method;
module M2: storing the mechanism index and news to be matched according to a preset format;
module M3: and carrying out news mechanism matching according to the news to be matched and the mechanism index.
Specifically, the module M1 includes:
module M1.1: the mechanism words comprise N characters, K characters before the mechanism words are selected as mechanism word prefixes, and the N-K characters are used as mechanism word suffixes;
module M1.2: constructing a prefix tree by taking K-character prefix words as key values and taking mechanism suffix words with the same prefix words as value values;
the module M1.2 comprises: and when the value list size of the mechanism suffix words with the same prefix words exceeds a preset value, carrying out prefix length expansion to ensure that the value list size of each key value is in a preset range.
Specifically, the module M2 includes:
module M2.1: converting K word prefix words in the mechanism index into hash codes through a hash algorithm, storing the hash codes, and storing the hash codes as a prefix word dictionary;
module M2.2: and coding and storing the mechanisms in the value list in the mechanism index.
Specifically, the module M3 includes:
module M3.1: carrying out formatting pretreatment on news to be matched of files with different formats to obtain the pretreated news to be matched;
module M3.2: carrying out sentence segmentation and word segmentation on the preprocessed news to be matched according to a preset rule;
module M3.3: performing mechanism prefix matching and mechanism full-name matching according to the mechanism index;
module M3.4: performing data filtering processing on the matched mechanism, and outputting the matched mechanism;
said module M3.3 comprises:
module M3.3.1: loading a prefix file to obtain a prefix word dictionary;
module M3.3.2: circulating sentence subsets of news to be matched, comparing K-word short words in each sentence with a prefix word dictionary, and performing mechanism full-name matching on the sentences containing the prefix words and a value list corresponding to the prefix words when the short words exist in the prefix word dictionary; when the short word does not exist in the prefix word dictionary, the triggering module M3.3.2 is repeatedly triggered to execute; when the sentence containing the prefix word is matched with the mechanism in the value list, the matching structure is added into the result list, and when the sentence containing the prefix word is not matched with the mechanism in the value list, the triggering module M3.3.2 is repeatedly triggered to execute until the matching of the news to be matched is finished.
Example 2
Example 2 is a modification of example 1
1. Mechanism index building module
Step 1: selecting K characters before a mechanism as a prefix of the mechanism word, and taking N-K characters of the mechanism word as a suffix;
step 2: constructing a prefix tree by taking K word prefix words as Key values and taking mechanism suffix words with the same prefix words as Value values;
the structural effect is schematically shown as follows (taking K as an example to be 3): as shown in figure 1 of the drawings, in which,
comparing the efficiency of different prefix lengths: as shown in fig. 2
And step 3: for the mechanism with larger prefix word universality, namely prefixes with overlarge suffix Value lists, such as Shanghai, Beijing and the like, prefix length expansion is carried out, so that the Value list size of each Key Value is in a self-defined range.
Data storage module
Step 1: for the constructed mechanism prefix tree, converting K-character prefix words into HashCode through a Hash algorithm
Step 2: constructing a code corresponding relation for mechanisms in the Value list, and converting character string types into numerical types by using codes, so that the storage space is reduced, and the query speed is accelerated;
and step 3: storing the prefix words as files to a hard disk, and storing the converted mechanism index into a Redis cluster;
3. news agency matching module
3.1 input module
The module is used for acquiring news to be matched. The input module can be suitable for various input modes, such as: copying and pasting news text, reading a database, transmitting a message queue, reading a file path and the like;
3.2 News preprocessing module
The module is mainly used for carrying out standardized processing on news acquired from the input module
Step 1: if the news is in a file format, such as PDF, Word, HTML and the like, file conversion is needed to be carried out firstly, and the text content in the file is obtained; if the news is in a text format, executing the step 2;
step 2; the text punctuations are processed uniformly and converted into uniform identifiers; characters, which are not Chinese, English and Arabic numerals, in the text are removed;
and step 3: outputting formatted news text
3.3 text splitting module
Step 1: splitting the text into a news sentence subset according to punctuations;
step 2: according to the prefix length of the mechanism, the sentence is split into K-character short words, and the K-character short words enter a mechanism matching module
3.4, a mechanism matching module, as shown in fig. 3;
step 1: loading a prefix file to obtain a prefix word dictionary;
step 2: circulating the sentence subset, and comparing the K word short words in each sentence with the prefix word dictionary; if the short words exist in the prefix word dictionary, entering the step 3, and if the short words do not exist, continuing the step 2;
and step 3: carrying out mechanism full name matching on the sentence Sen1 containing the prefix words and a Value1 list corresponding to the prefix word Key1, and accelerating the matching speed by using a KMP algorithm; if the organization [ Org1, Org 2. ] in the Value1 list is matched in the Sen1, adding the matching result into the result list, and if the organization is not matched, returning to the step 2;
3.5 output module
And loading stop words and a stop mechanism, filtering an output result list of the mechanism matching module, and outputting a final mechanism matching result. Those skilled in the art will appreciate that, in addition to implementing the systems, apparatus, and various modules thereof provided by the present invention in purely computer readable program code, the same procedures can be implemented entirely by logically programming method steps such that the systems, apparatus, and various modules thereof are provided in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system, the device and the modules thereof provided by the present invention can be considered as a hardware component, and the modules included in the system, the device and the modules thereof for implementing various programs can also be considered as structures in the hardware component; modules for performing various functions may also be considered to be both software programs for performing the methods and structures within hardware components.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims (10)

1. A short text matching method applied to the news field is characterized by comprising the following steps:
step M1: constructing a mechanism index for the mechanism words to be matched by using a K-word prefix tree method;
step M2: storing the mechanism index and news to be matched according to a preset format;
step M3: and carrying out news mechanism matching according to the news to be matched and the mechanism index.
2. The short text matching method applied to the news domain as set forth in claim 1, wherein the step M1 comprises:
step M1.1: the mechanism words comprise N characters, K characters before the mechanism words are selected as mechanism word prefixes, and the N-K characters are used as mechanism word suffixes;
step M1.2: and constructing a prefix tree by taking the prefix words of the K characters as key values and taking the mechanism suffix words with the same prefix words as value values.
3. The short text matching method applied to the news domain as set forth in claim 2, wherein the step M1.2 comprises: and when the value list size of the mechanism suffix words with the same prefix words exceeds a preset value, carrying out prefix length expansion to ensure that the value list size of each key value is in a preset range.
4. The short text matching method applied to the news domain as set forth in claim 1, wherein the step M2 comprises:
step M2.1: converting K word prefix words in the mechanism index into hash codes through a hash algorithm, storing the hash codes, and storing the hash codes as a prefix word dictionary;
step M2.2: and coding and storing the mechanisms in the value list in the mechanism index.
5. The short text matching method applied to the news domain as set forth in claim 1, wherein the step M3 comprises:
step M3.1: carrying out formatting pretreatment on news to be matched of files with different formats to obtain the pretreated news to be matched;
step M3.2: carrying out sentence segmentation and word segmentation on the preprocessed news to be matched according to a preset rule;
step M3.3: performing mechanism prefix matching and mechanism full-name matching according to the mechanism index;
step M3.4: and performing data filtering processing on the matched mechanism, and outputting the matched mechanism.
6. The short text matching method applied to the news domain as set forth in claim 5, wherein the step M3.3 comprises:
step M3.3.1: loading a prefix file to obtain a prefix word dictionary;
step M3.3.2: circulating sentence subsets of news to be matched, comparing K-word short words in each sentence with a prefix word dictionary, and performing mechanism full-name matching on the sentences containing the prefix words and a value list corresponding to the prefix words when the short words exist in the prefix word dictionary; when the short word does not exist in the dictionary of the prefix word, the step M3.3.2 is repeatedly executed; and when the sentence containing the prefix word does not have the mechanism matched with the value list, repeatedly executing the step M3.3.2 until the matching of the news to be matched is finished.
7. A short text matching system applied to the news field is characterized by comprising:
module M1: constructing a mechanism index for the mechanism words to be matched by using a K-word prefix tree method;
module M2: storing the mechanism index and news to be matched according to a preset format;
module M3: and carrying out news mechanism matching according to the news to be matched and the mechanism index.
8. The short text matching system applied to the news domain as set forth in claim 7, wherein the module M1 comprises:
module M1.1: the mechanism words comprise N characters, K characters before the mechanism words are selected as mechanism word prefixes, and the N-K characters are used as mechanism word suffixes;
module M1.2: constructing a prefix tree by taking K-character prefix words as key values and taking mechanism suffix words with the same prefix words as value values;
the module M1.2 comprises: and when the value list size of the mechanism suffix words with the same prefix words exceeds a preset value, carrying out prefix length expansion to ensure that the value list size of each key value is in a preset range.
9. The short text matching system applied to the news domain as set forth in claim 7, wherein the module M2 comprises:
module M2.1: converting K word prefix words in the mechanism index into hash codes through a hash algorithm, storing the hash codes, and storing the hash codes as a prefix word dictionary;
module M2.2: and coding and storing the mechanisms in the value list in the mechanism index.
10. The short text matching system applied to the news domain as set forth in claim 1, wherein the module M3 comprises:
module M3.1: carrying out formatting pretreatment on news to be matched of files with different formats to obtain the pretreated news to be matched;
module M3.2: carrying out sentence segmentation and word segmentation on the preprocessed news to be matched according to a preset rule;
module M3.3: performing mechanism prefix matching and mechanism full-name matching according to the mechanism index;
module M3.4: performing data filtering processing on the matched mechanism, and outputting the matched mechanism;
said module M3.3 comprises:
module M3.3.1: loading a prefix file to obtain a prefix word dictionary;
module M3.3.2: circulating sentence subsets of news to be matched, comparing K-word short words in each sentence with a prefix word dictionary, and performing mechanism full-name matching on the sentences containing the prefix words and a value list corresponding to the prefix words when the short words exist in the prefix word dictionary; when the short word does not exist in the prefix word dictionary, the triggering module M3.3.2 is repeatedly triggered to execute; when the sentence containing the prefix word is matched with the mechanism in the value list, the matching structure is added into the result list, and when the sentence containing the prefix word is not matched with the mechanism in the value list, the triggering module M3.3.2 is repeatedly triggered to execute until the matching of the news to be matched is finished.
CN202011424390.7A 2020-12-08 2020-12-08 Short text matching method and system applied to news field Pending CN112527957A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011424390.7A CN112527957A (en) 2020-12-08 2020-12-08 Short text matching method and system applied to news field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011424390.7A CN112527957A (en) 2020-12-08 2020-12-08 Short text matching method and system applied to news field

Publications (1)

Publication Number Publication Date
CN112527957A true CN112527957A (en) 2021-03-19

Family

ID=74998241

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011424390.7A Pending CN112527957A (en) 2020-12-08 2020-12-08 Short text matching method and system applied to news field

Country Status (1)

Country Link
CN (1) CN112527957A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115438145A (en) * 2022-04-13 2022-12-06 盐城金堤科技有限公司 Method and device for adding enterprise detail internal chain

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101271466A (en) * 2008-04-30 2008-09-24 中山大学 Electronic dictionary work retrieval method based on self-adapting dictionary tree
US20090174583A1 (en) * 2008-01-08 2009-07-09 International Business Machines Corporation Method for Compressed Data with Reduced Dictionary Sizes by Coding Value Prefixes
CN105871726A (en) * 2016-03-21 2016-08-17 哈尔滨工程大学 Mode matching method for dynamically adding tree node and unit based on common prefix
CN110688841A (en) * 2019-09-30 2020-01-14 广州准星信息科技有限公司 Mechanism name identification method, mechanism name identification device, mechanism name identification equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090174583A1 (en) * 2008-01-08 2009-07-09 International Business Machines Corporation Method for Compressed Data with Reduced Dictionary Sizes by Coding Value Prefixes
CN101271466A (en) * 2008-04-30 2008-09-24 中山大学 Electronic dictionary work retrieval method based on self-adapting dictionary tree
CN105871726A (en) * 2016-03-21 2016-08-17 哈尔滨工程大学 Mode matching method for dynamically adding tree node and unit based on common prefix
CN110688841A (en) * 2019-09-30 2020-01-14 广州准星信息科技有限公司 Mechanism name identification method, mechanism name identification device, mechanism name identification equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115438145A (en) * 2022-04-13 2022-12-06 盐城金堤科技有限公司 Method and device for adding enterprise detail internal chain
CN115438145B (en) * 2022-04-13 2024-05-14 盐城天眼察微科技有限公司 Method and device for adding enterprise detail inner links

Similar Documents

Publication Publication Date Title
US7031910B2 (en) Method and system for encoding and accessing linguistic frequency data
US11308937B2 (en) Method and apparatus for identifying key phrase in audio, device and medium
CN102915299B (en) Word segmentation method and device
US8175875B1 (en) Efficient indexing of documents with similar content
US8694474B2 (en) Block entropy encoding for word compression
CN109815336B (en) Text aggregation method and system
CN109033410B (en) SQL (structured query language) analysis method based on regular and character string cutting
WO2010043984A2 (en) Mining new words from a query log for input method editors
WO2008098507A1 (en) An input method of combining words intelligently, input method system and renewing method
CN112527957A (en) Short text matching method and system applied to news field
CN111782810A (en) Text abstract generation method based on theme enhancement
Flor A fast and flexible architecture for very large word n-gram datasets
CN113468209A (en) High-speed memory database access method for power grid monitoring system
CN113032371A (en) Database grammar analysis method and device and computer equipment
Youzhuo et al. Research on lucene based full-text query search service for smart distribution system
US10380195B1 (en) Grouping documents by content similarity
CN111930959B (en) Method and device for generating text by map knowledge
Zhang et al. A dynamic window split-based approach for extracting professional terms from Chinese courses
Kostrov et al. Application of probabilistic approach while forming hash-function by signature in the process of domain-specific local database analysis
CN118295980A (en) High compression ratio compression algorithm for industrial control system log
Xiong An Algorithm Rapidly Segmenting Chinese Sentences into Individual Words
Zhou et al. Improved query model for rapidly query based on distributed hash index
Yang et al. A dictionary mechanism for Chinese word segmentation based on the finite automata
CN115545040A (en) Vehicle type function analysis method and device
Cuo et al. Research on Tibetan Web Standard Text Data Model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination