CN112527957A

CN112527957A - Short text matching method and system applied to news field

Info

Publication number: CN112527957A
Application number: CN202011424390.7A
Authority: CN
Inventors: 张友豪; 冯卫强
Original assignee: Shanghai Financial China Information & Technology Co ltd
Current assignee: Shanghai Financial China Information & Technology Co ltd
Priority date: 2020-12-08
Filing date: 2020-12-08
Publication date: 2021-03-19

Abstract

The invention provides a short text matching method and a short text matching system applied to the news field, wherein the short text matching method comprises the following steps: step M1: constructing a mechanism index for the mechanism words to be matched by using a k-word prefix tree method; step M2: storing the mechanism index and news to be matched according to a preset format; step M3: and carrying out news mechanism matching according to the news to be matched and the mechanism index. The method and the device can quickly match related mechanisms in massive news data, solve the problem of low matching efficiency of the news data, improve the query efficiency and save the storage space.

Description

Short text matching method and system applied to news field

Technical Field

The invention relates to the technical field of data processing and news retrieval, in particular to a short text matching method and a short text matching system applied to the news field; and more particularly, to a method and system for string processing and high concurrency news agency matching.

Background

With the development of the internet, under the situation of continuous improvement of science and technology, data enters a big outbreak era, and particularly various news emerge endlessly. How to quickly acquire organizations in news in massive news becomes an important technology in the field of news data processing.

Two main challenges are faced in the current news agency matching technology development process: the first is the problem of complexity of matching time, with the arrival of a big data era, the news data volume is increased rapidly, the matching characteristics are more and more, and the matching process is more and more complicated; the second challenge is the efficiency requirement, and as the internet develops, the timeliness requirement of data becomes higher and higher, and the requirement on the processing capacity of the mechanism matching system is high.

In order to solve the difficulties, the system adopts a K-word prefix tree method to construct indexes for tens of millions of mechanisms, and utilizes a Redis cluster to perform distributed index storage, so that the large space complexity is greatly reduced, and the system has the advantages of compromising the suffix number and the suffix array in terms of calculation space and search speed. And meanwhile, a KMP algorithm is adopted, so that the matching performance is improved.

Patent document CN110321562A (application number: 201910576788.3) discloses a BERT-based short text matching method, which obtains first supervised task data of a first scene according to a requirement of the first scene, performs noise reduction processing on the first supervised task data to generate first data, extracts a first keyword from the first data, performs conversion processing on the first data and the first keyword to generate a first original expression and a first feature expression, inputs the first original expression and the first feature expression to a preset short text matching model respectively, generates a first score of the first original expression and a second score of the first feature expression, and finally determines whether the first score and/or the second score reach a preset threshold, if so, determines that the first supervised task data belongs to a positive sample, otherwise determines that the first supervised task data belongs to a negative sample, the method can play the role of prior knowledge to the maximum extent under the condition of limited supervision task data, and has stronger robustness and interpretability.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a short text matching method and a short text matching system applied to the news field.

The short text matching method applied to the news field provided by the invention comprises the following steps:

step M1: constructing a mechanism index for the mechanism words to be matched by using a K-word prefix tree method;

step M2: storing the mechanism index and news to be matched according to a preset format;

step M3: and carrying out news mechanism matching according to the news to be matched and the mechanism index.

Preferably, the step M1 includes:

step M1.1: the mechanism words comprise N characters, K characters before the mechanism words are selected as mechanism word prefixes, and the N-K characters are used as mechanism word suffixes;

step M1.2: and constructing a prefix tree by taking the prefix words of the K characters as key values and taking the mechanism suffix words with the same prefix words as value values.

Preferably, said step M1.2 comprises: and when the value list size of the mechanism suffix words with the same prefix words exceeds a preset value, carrying out prefix length expansion to ensure that the value list size of each key value is in a preset range.

Preferably, the step M2 includes:

step M2.1: converting K word prefix words in the mechanism index into hash codes through a hash algorithm, storing the hash codes, and storing the hash codes as a prefix word dictionary;

step M2.2: and coding and storing the mechanisms in the value list in the mechanism index.

Preferably, the step M3 includes:

step M3.1: carrying out formatting pretreatment on news to be matched of files with different formats to obtain the pretreated news to be matched;

step M3.2: carrying out sentence segmentation and word segmentation on the preprocessed news to be matched according to a preset rule;

step M3.3: performing mechanism prefix matching and mechanism full-name matching according to the mechanism index;

step M3.4: and performing data filtering processing on the matched mechanism, and outputting the matched mechanism.

Preferably, said step M3.3 comprises:

step M3.3.1: loading a prefix file to obtain a prefix word dictionary;

step M3.3.2: circulating sentence subsets of news to be matched, comparing K-word short words in each sentence with a prefix word dictionary, and performing mechanism full-name matching on the sentences containing the prefix words and a value list corresponding to the prefix words when the short words exist in the prefix word dictionary; when the short word does not exist in the dictionary of the prefix word, the step M3.3.2 is repeatedly executed; and when the sentence containing the prefix word does not have the mechanism matched with the value list, repeatedly executing the step M3.3.2 until the matching of the news to be matched is finished.

The invention provides a short text matching system applied to the news field, which comprises the following components:

module M1: constructing a mechanism index for the mechanism words to be matched by using a K-word prefix tree method;

module M2: storing the mechanism index and news to be matched according to a preset format;

module M3: and carrying out news mechanism matching according to the news to be matched and the mechanism index.

Preferably, said module M1 comprises:

module M1.1: the mechanism words comprise N characters, K characters before the mechanism words are selected as mechanism word prefixes, and the N-K characters are used as mechanism word suffixes;

module M1.2: constructing a prefix tree by taking K-character prefix words as key values and taking mechanism suffix words with the same prefix words as value values;

the module M1.2 comprises: and when the value list size of the mechanism suffix words with the same prefix words exceeds a preset value, carrying out prefix length expansion to ensure that the value list size of each key value is in a preset range.

Preferably, said module M2 comprises:

module M2.1: converting K word prefix words in the mechanism index into hash codes through a hash algorithm, storing the hash codes, and storing the hash codes as a prefix word dictionary;

module M2.2: and coding and storing the mechanisms in the value list in the mechanism index.

Preferably, said module M3 comprises:

module M3.1: carrying out formatting pretreatment on news to be matched of files with different formats to obtain the pretreated news to be matched;

module M3.2: carrying out sentence segmentation and word segmentation on the preprocessed news to be matched according to a preset rule;

module M3.3: performing mechanism prefix matching and mechanism full-name matching according to the mechanism index;

module M3.4: performing data filtering processing on the matched mechanism, and outputting the matched mechanism;

said module M3.3 comprises:

module M3.3.1: loading a prefix file to obtain a prefix word dictionary;

module M3.3.2: circulating sentence subsets of news to be matched, comparing K-word short words in each sentence with a prefix word dictionary, and performing mechanism full-name matching on the sentences containing the prefix words and a value list corresponding to the prefix words when the short words exist in the prefix word dictionary; when the short word does not exist in the prefix word dictionary, the triggering module M3.3.2 is repeatedly triggered to execute; when the sentence containing the prefix word is matched with the mechanism in the value list, the matching structure is added into the result list, and when the sentence containing the prefix word is not matched with the mechanism in the value list, the triggering module M3.3.2 is repeatedly triggered to execute until the matching of the news to be matched is finished.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention provides a method for constructing and storing a text index in a distributed manner, which improves the query efficiency;

2. the invention provides a method and a system for matching character strings, which aim to solve the technical problem of low data matching efficiency under the condition of mass data;

3. the method and the device can quickly match related mechanisms in massive news data, solve the problem of low matching efficiency of the news data, improve the query efficiency and save the storage space.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a schematic diagram of a prefix tree construction;

FIG. 2 is a comparison of different prefix length efficiencies;

fig. 3 is a news agency matching flow chart.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.

Example 1

The short text matching method applied to the news field provided by the invention comprises the following steps: as shown in fig. 1-3;

Specifically, the step M1 includes:

In particular, said step M1.2 comprises: and when the value list size of the mechanism suffix words with the same prefix words exceeds a preset value, carrying out prefix length expansion to ensure that the value list size of each key value is in a preset range.

Specifically, the step M2 includes:

Specifically, the step M3 includes:

In particular, said step M3.3 comprises:

step M3.3.1: loading a prefix file to obtain a prefix word dictionary;

Specifically, the module M1 includes:

Specifically, the module M2 includes:

Specifically, the module M3 includes:

said module M3.3 comprises:

module M3.3.1: loading a prefix file to obtain a prefix word dictionary;

Example 2

Example 2 is a modification of example 1

1. Mechanism index building module

Step 1: selecting K characters before a mechanism as a prefix of the mechanism word, and taking N-K characters of the mechanism word as a suffix;

step 2: constructing a prefix tree by taking K word prefix words as Key values and taking mechanism suffix words with the same prefix words as Value values;

the structural effect is schematically shown as follows (taking K as an example to be 3): as shown in figure 1 of the drawings, in which,

comparing the efficiency of different prefix lengths: as shown in fig. 2

And step 3: for the mechanism with larger prefix word universality, namely prefixes with overlarge suffix Value lists, such as Shanghai, Beijing and the like, prefix length expansion is carried out, so that the Value list size of each Key Value is in a self-defined range.

Data storage module

Step 1: for the constructed mechanism prefix tree, converting K-character prefix words into HashCode through a Hash algorithm

Step 2: constructing a code corresponding relation for mechanisms in the Value list, and converting character string types into numerical types by using codes, so that the storage space is reduced, and the query speed is accelerated;

and step 3: storing the prefix words as files to a hard disk, and storing the converted mechanism index into a Redis cluster;

3. news agency matching module

3.1 input module

The module is used for acquiring news to be matched. The input module can be suitable for various input modes, such as: copying and pasting news text, reading a database, transmitting a message queue, reading a file path and the like;

3.2 News preprocessing module

The module is mainly used for carrying out standardized processing on news acquired from the input module

Step 1: if the news is in a file format, such as PDF, Word, HTML and the like, file conversion is needed to be carried out firstly, and the text content in the file is obtained; if the news is in a text format, executing the step 2;

step 2; the text punctuations are processed uniformly and converted into uniform identifiers; characters, which are not Chinese, English and Arabic numerals, in the text are removed;

and step 3: outputting formatted news text

3.3 text splitting module

Step 1: splitting the text into a news sentence subset according to punctuations;

step 2: according to the prefix length of the mechanism, the sentence is split into K-character short words, and the K-character short words enter a mechanism matching module

3.4, a mechanism matching module, as shown in fig. 3;

step 1: loading a prefix file to obtain a prefix word dictionary;

step 2: circulating the sentence subset, and comparing the K word short words in each sentence with the prefix word dictionary; if the short words exist in the prefix word dictionary, entering the step 3, and if the short words do not exist, continuing the step 2;

and step 3: carrying out mechanism full name matching on the sentence Sen1 containing the prefix words and a Value1 list corresponding to the prefix word Key1, and accelerating the matching speed by using a KMP algorithm; if the organization [ Org1, Org 2. ] in the Value1 list is matched in the Sen1, adding the matching result into the result list, and if the organization is not matched, returning to the step 2;

3.5 output module

And loading stop words and a stop mechanism, filtering an output result list of the mechanism matching module, and outputting a final mechanism matching result. Those skilled in the art will appreciate that, in addition to implementing the systems, apparatus, and various modules thereof provided by the present invention in purely computer readable program code, the same procedures can be implemented entirely by logically programming method steps such that the systems, apparatus, and various modules thereof are provided in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system, the device and the modules thereof provided by the present invention can be considered as a hardware component, and the modules included in the system, the device and the modules thereof for implementing various programs can also be considered as structures in the hardware component; modules for performing various functions may also be considered to be both software programs for performing the methods and structures within hardware components.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims

1. A short text matching method applied to the news field is characterized by comprising the following steps:

2. The short text matching method applied to the news domain as set forth in claim 1, wherein the step M1 comprises:

3. The short text matching method applied to the news domain as set forth in claim 2, wherein the step M1.2 comprises: and when the value list size of the mechanism suffix words with the same prefix words exceeds a preset value, carrying out prefix length expansion to ensure that the value list size of each key value is in a preset range.

4. The short text matching method applied to the news domain as set forth in claim 1, wherein the step M2 comprises:

5. The short text matching method applied to the news domain as set forth in claim 1, wherein the step M3 comprises:

6. The short text matching method applied to the news domain as set forth in claim 5, wherein the step M3.3 comprises:

step M3.3.1: loading a prefix file to obtain a prefix word dictionary;

7. A short text matching system applied to the news field is characterized by comprising:

8. The short text matching system applied to the news domain as set forth in claim 7, wherein the module M1 comprises:

9. The short text matching system applied to the news domain as set forth in claim 7, wherein the module M2 comprises:

10. The short text matching system applied to the news domain as set forth in claim 1, wherein the module M3 comprises:

said module M3.3 comprises:

module M3.3.1: loading a prefix file to obtain a prefix word dictionary;