CN111581459B - Character string matching method and character string matching system - Google Patents

Character string matching method and character string matching system Download PDF

Info

Publication number
CN111581459B
CN111581459B CN202010538767.5A CN202010538767A CN111581459B CN 111581459 B CN111581459 B CN 111581459B CN 202010538767 A CN202010538767 A CN 202010538767A CN 111581459 B CN111581459 B CN 111581459B
Authority
CN
China
Prior art keywords
matched
character
character string
boundary
length
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010538767.5A
Other languages
Chinese (zh)
Other versions
CN111581459A (en
Inventor
杨嘉佳
唐球
徐睿
刘金
张雷
吴云峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
6th Research Institute of China Electronics Corp
Original Assignee
6th Research Institute of China Electronics Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 6th Research Institute of China Electronics Corp filed Critical 6th Research Institute of China Electronics Corp
Priority to CN202010538767.5A priority Critical patent/CN111581459B/en
Publication of CN111581459A publication Critical patent/CN111581459A/en
Application granted granted Critical
Publication of CN111581459B publication Critical patent/CN111581459B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a character string matching method and a character string matching system, a text to be matched is divided into a plurality of sections of character strings to be matched, meanwhile, in order to avoid missing boundary characters of each section of character string to be matched, at least one boundary character is extracted from one side, adjacent to each other, of any two adjacent sections of character strings to be matched, so that a plurality of sections of boundary character strings to be matched are obtained, and when character string matching is carried out, a target character string matched with a reference character string is determined from the plurality of sections of character strings to be matched and the plurality of sections of boundary character strings to be matched. Furthermore, when the character string matching is carried out, the completeness of all matched characters in the matching process can be guaranteed, the character string matching efficiency is effectively improved, the consumption of character string matching time is greatly reduced, and the character string matching performance is improved.

Description

Character string matching method and character string matching system
Technical Field
The present application relates to the field of information processing technologies, and in particular, to a character string matching method and a character string matching system.
Background
The report detection technology of the hot topics can find and summarize important information and content from social media, detect the hot topics from reports of network texts, and track the evolution process of the topics in real time.
In the report detection technology of the hot topic, character string matching is a key technology. The user needs to perform character string matching every time the user detects the hot topics, when the user searches for information by keywords, the search engine searches for important information and content of social media, and if the information which is consistent with the content required by the user is found, the inquired information is returned to the user so that the user can check and select the information.
With the continuous development of social media, the reported contents of web texts are more and more, so that when hot topics are detected, the length of a character string needing to be matched is exponentially increased, the consumed time is continuously increased in the character string matching process, and the matching efficiency of the character string is continuously reduced.
Disclosure of Invention
In view of the above, an object of the present application is to provide a character string matching method and a character string matching system, which perform reference character string matching on a plurality of segments of character strings to be matched and a plurality of segments of boundary character strings to be matched respectively by dividing a text to be matched into the plurality of segments of character strings to be matched and the plurality of segments of boundary character strings to be matched. When the character strings are matched, the integrity of all the matched characters in the matching process can be ensured, the character string matching efficiency is effectively improved, the consumption of the character string matching time is greatly reduced, and the character string matching performance is improved.
In a first aspect, the present application provides a character string matching method, including:
acquiring a text to be matched and a reference character string aiming at the text to be matched;
determining a plurality of sections of character strings to be matched from the text to be matched, wherein the character length of the character strings to be matched is greater than or equal to the character length of the reference character string;
respectively extracting at least one boundary character from the mutually adjacent sides of any two adjacent sections of character strings to be matched, and determining a plurality of sections of boundary character strings to be matched, wherein each section of boundary character string to be matched comprises a plurality of boundary characters extracted from the adjacent two sections of character strings to be matched, and the character length of each section of boundary character string to be matched is greater than or equal to that of the reference character string;
and determining a target character string matched with the reference character string from the plurality of sections of character strings to be matched and the plurality of sections of boundary character strings to be matched.
Preferably, after the target character string matched with the reference character string is determined from the plurality of segments of character strings to be matched and the plurality of segments of boundary character strings to be matched, the character string matching method further includes:
and counting the number of the target character strings matched with the reference character strings.
Preferably, the plurality of segments of the character strings to be matched are determined by:
acquiring the character length of the reference character string;
determining the division step length of the text to be matched based on the character length of the reference character string;
and based on the division step length, dividing the character strings of the text to be matched by taking the first character of the text to be matched as a starting point, and determining a plurality of sections of character strings to be matched.
Preferably, the boundary string to be matched is determined by:
determining the character length of the boundary character string to be matched;
extracting boundary characters from two adjacent segments of character strings to be matched based on the character length of the boundary character strings to be matched;
and determining the extracted boundary character as a boundary character string to be matched.
Preferably, the character length of the boundary character string to be matched is determined by the following steps:
determining the character length of boundary characters extracted from the mutually adjacent sides of any two adjacent character strings to be matched based on the character length of the reference character string;
and determining the character length of the boundary character string to be matched based on the character length of the boundary character.
Preferably, the character length of the boundary character string to be matched is determined by the following formula:
M=2×(m-1);
wherein M represents the character length of the boundary character string to be matched, M represents the character length of the reference character string, and M-1 represents the character length of the boundary character.
Preferably, the determining, from the plurality of segments of character strings to be matched and the plurality of segments of boundary character strings to be matched, the target character string matched with the reference character string includes:
determining initial characters of the character string to be matched and the boundary character string to be matched;
and respectively searching a target character string which is the same as the reference character string from each section of character string to be matched and each section of boundary character string to be matched by taking the determined initial character as a starting point and the character length of the reference character string as a matching step length.
In a second aspect, the present application provides a string matching system, comprising:
the acquisition module is used for acquiring a text to be matched and a reference character string aiming at the text to be matched;
the first determining module is used for determining a plurality of sections of character strings to be matched from the text to be matched, wherein the character length of the character strings to be matched is greater than or equal to the character length of the reference character string;
the second determining module is used for respectively extracting at least one boundary character from the mutually adjacent sides of any two adjacent sections of character strings to be matched and determining a plurality of sections of boundary character strings to be matched, wherein each section of boundary character string to be matched comprises a plurality of boundary characters extracted from the two adjacent sections of character strings to be matched, and the character length of each section of boundary character string to be matched is greater than or equal to the character length of the reference character string;
and the third determining module is used for determining a target character string matched with the reference character string from the multiple segments of character strings to be matched and the multiple segments of boundary character strings to be matched.
Preferably, after the third determining module is configured to determine a target character string matching the reference character string from the multiple segments of character strings to be matched and the multiple segments of boundary character strings to be matched, the character string matching system further includes:
and the counting module is used for counting the number of the target character strings matched with the reference character strings.
Preferably, the first determining module is configured to determine a plurality of segments of character strings to be matched by:
acquiring the character length of the reference character string;
determining the division step length of the text to be matched based on the character length of the reference character string;
and based on the division step length, dividing the character strings of the text to be matched by taking the first character of the text to be matched as a starting point, and determining a plurality of sections of character strings to be matched.
Preferably, the second determining module is configured to determine the boundary string to be matched by:
determining the character length of the boundary character string to be matched;
extracting boundary characters from two adjacent segments of character strings to be matched based on the character length of the boundary character strings to be matched;
and determining the extracted boundary character as a boundary character string to be matched.
Preferably, the second determining module is further configured to determine the character length of the boundary character string to be matched by:
determining the character length of boundary characters extracted from the mutually adjacent sides of any two adjacent character strings to be matched based on the character length of the reference character string;
and determining the character length of the boundary character string to be matched based on the character length of the boundary character.
Preferably, the second determining module is configured to determine the character length of the boundary character string to be matched by the following formula:
M=2×(m-1);
wherein M represents the character length of the boundary character string to be matched, M represents the character length of the reference character string, and M-1 represents the character length of the boundary character.
Preferably, when the third determining module is configured to determine the target character string matched with the reference character string from the multiple segments of the character strings to be matched and the multiple segments of the boundary character strings to be matched, the third determining module is specifically configured to:
determining initial characters of the character string to be matched and the boundary character string to be matched;
and respectively searching a target character string which is the same as the reference character string from each section of character string to be matched and each section of boundary character string to be matched by taking the determined initial character as a starting point and the character length of the reference character string as a matching step length.
In a third aspect, the present application provides an electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when the electronic device is running, the machine-readable instructions when executed by the processor performing the steps of the string matching method according to the first aspect.
In a fourth aspect, the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the string matching method according to the first aspect.
The embodiment of the application provides a character string matching method and a character string matching system, when character string matching is carried out, a text to be matched is firstly divided into a plurality of sections of character strings to be matched, meanwhile, in order to avoid missing boundary characters of each section of character string to be matched, at least one boundary character is extracted from one side, adjacent to each other, of any two adjacent sections of character strings to be matched, so that a plurality of sections of boundary character strings to be matched are obtained, and when character string matching is carried out, a target character string matched with a reference character string is determined from the plurality of sections of character strings to be matched and the plurality of sections of boundary character strings to be matched. In this way, the text to be matched is divided into a plurality of sections of character strings to be matched and a plurality of sections of boundary character strings to be matched, and the character strings to be matched and the boundary character strings to be matched are respectively matched with the reference character strings. When the character strings are matched, the integrity of all the matched characters in the matching process can be ensured, the character string matching efficiency is effectively improved, the consumption of the character string matching time is greatly reduced, and the character string matching performance is improved.
In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.
Fig. 1 is a flowchart of a string matching method according to an embodiment of the present disclosure;
FIG. 2 is a flow chart of another string matching method provided in the embodiments of the present application;
fig. 3 is a schematic structural diagram of a string matching system according to an embodiment of the present application;
fig. 4 is a second schematic structural diagram of a string matching system according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. Every other embodiment that can be obtained by a person skilled in the art without making creative efforts based on the embodiments of the present application falls within the protection scope of the present application.
In the report detection technology of the hot topic, character string matching is a key technology. The user needs to perform character string matching every time the user detects the hot topics, when the user searches for information by keywords, the search engine searches for important information and content of social media, and if the information which is consistent with the content required by the user is found, the inquired information is returned to the user so that the user can check and select the information. With the continuous development of social media, the reported contents of web texts are more and more, so that the length of a character string to be matched is exponentially increased when a hot topic is detected, the time consumed in the character string matching process is continuously increased, and the matching efficiency of the character string is reduced. Based on this, the embodiment of the application provides a character string matching method and a character string matching system, and a parallel matching processing method is adopted to perform character string matching, so that the matching efficiency of character strings is improved to a certain extent.
Referring to fig. 1, fig. 1 is a flowchart of a string matching method according to an embodiment of the present disclosure, and as shown in fig. 1, the embodiment of the present disclosure provides a string matching method, where the string matching method includes:
s110, obtaining a text to be matched and a reference character string aiming at the text to be matched.
In this step, the text to be matched can be derived from a network text of a social media, a tweet in the fields of economy, science and technology, sports and the like, a microblog tweet and the like. The reference character string is a keyword, and the keyword exists in web texts and various tweets. In the embodiment of the present application, the number and the character length of the reference character string are not specifically limited.
Specifically, the text to be matched is shown in the form of an array.
S120, determining a plurality of sections of character strings to be matched from the text to be matched, wherein the character length of the character strings to be matched is greater than or equal to the character length of the reference character string.
In the step, the text to be matched is divided into a plurality of sections of character strings to be matched, and the character length of each section of character string to be matched is larger than or equal to that of the reference character string, so that when the character strings are matched, the character strings to be matched can be ensured to include the reference character string, and the character strings matched with the reference character string can be found out from the character strings to avoid omission of the reference character string.
S130, respectively extracting at least one boundary character from the mutually adjacent sides of any two adjacent sections of character strings to be matched to obtain a plurality of sections of boundary character strings to be matched, wherein each section of boundary character string to be matched comprises a plurality of boundary characters extracted from the adjacent two sections of character strings to be matched, and the character length of each section of boundary character string to be matched is greater than or equal to the character length of the reference character string.
It should be noted that, when the text to be matched is divided into a plurality of sections of character strings to be matched, each section of character string to be matched has boundary characters, and since the text to be matched is divided, a complete character string is separated, so that when the character strings are matched, the character strings are omitted, and an erroneous matching result occurs.
Therefore, in this step, at least one boundary character can be extracted from each of the mutually adjacent sides of two adjacent segments of the string to be matched to form a string to be matched, and the boundary characters of any two adjacent segments of the string to be matched are extracted, so as to obtain a plurality of segments of the string to be matched.
In order to ensure that the reference character string can be effectively searched in the obtained boundary character string to be matched, the character length of each section of boundary character string to be matched is greater than or equal to the character length of the reference character string.
Thus, when matching of the reference character strings is carried out, the character strings matched with the reference character strings can be found, and omission in character string matching is avoided. Wherein at least one includes one, two or more, the determination of the number is based on the character length of the reference character string.
S140, determining a target character string matched with the reference character string from the multiple sections of character strings to be matched and the multiple sections of boundary character strings to be matched.
In the step, the character string which is the same as the reference character string is searched from the multiple sections of character strings to be matched and the multiple sections of boundary character strings to be matched simultaneously, the character string which is the same as the reference character string is determined as a target character string, the speed of character string matching can be improved by adopting a parallel searching mode, furthermore, in the step, the character string matching can be simultaneously carried out by adopting a basic character string matching algorithm, a multithreading technology and the like, and the processing efficiency of a long text to be matched is greatly improved by adopting parallel pipeline processing.
The basic string matching algorithm adopted in the embodiment of the present application is not limited herein, and common basic string matching algorithms include a KMP algorithm, a BM algorithm, a finite automata algorithm, and the like.
Furthermore, in order to improve the matching performance of the character strings, the embodiment of the application adopts a parallel processing idea to segment the text to be matched, and then performs parallel processing to improve the matching performance of the character strings.
The embodiment of the application provides a character string matching method, when character string matching is carried out, a text to be matched is firstly divided into a plurality of sections of character strings to be matched, meanwhile, in order to avoid missing boundary characters of each section of character string to be matched, at least one boundary character is respectively extracted from one side, adjacent to each other, of any two adjacent sections of character strings to be matched, so that the plurality of sections of boundary character strings to be matched are obtained, and when the character string matching is carried out, a target character string matched with a reference character string is determined from the plurality of sections of character strings to be matched and the plurality of sections of boundary character strings to be matched. Therefore, when the character strings are matched, the integrity of all the matched characters in the matching process can be ensured, the character string matching efficiency is effectively improved, the consumption of the character string matching time is greatly reduced, and the character string matching performance is improved.
Referring to fig. 2, fig. 2 is a flowchart of another string matching method according to an embodiment of the present disclosure; as shown in fig. 2, the character string matching method includes:
s210, obtaining a text to be matched and a reference character string aiming at the text to be matched.
S220, determining a plurality of sections of character strings to be matched from the text to be matched, wherein the character length of the character strings to be matched is greater than or equal to the character length of the reference character string.
And S230, respectively extracting at least one boundary character from the mutually adjacent sides of any two adjacent sections of character strings to be matched to obtain a plurality of sections of boundary character strings to be matched, wherein each section of boundary character string to be matched comprises a plurality of boundary characters extracted from the adjacent two sections of character strings to be matched, and the character length of each section of boundary character string to be matched is greater than or equal to the character length of the reference character string.
S240, determining a target character string matched with the reference character string from the multiple sections of character strings to be matched and the multiple sections of boundary character strings to be matched.
The descriptions of S110 to S140 may refer to the descriptions of S210 to S240, and the same technical effects can be achieved, which are not described in detail.
And S250, counting the number of the target character strings matched with the reference character strings.
In the step, the times of the target character strings are counted by inquiring the target character strings appearing in the multiple sections of character strings to be matched and the multiple sections of boundary character strings to be matched.
Furthermore, the embodiment of the present application performs string matching based on the following ideas: in order to obtain a correct matching result, the segment of the text to be matched and the boundary of the segment need to be segmented, so that a thought of distributed parallel processing is provided, namely, a long text is segmented and then distributed to each node for processing, and then the number of times of hitting each node is counted. For the segment boundary, special nodes are needed for matching, and then the matching result is fed back to the main node. And obtaining the total hit times according to all the matching results. Thereby realizing the parallel matching acceleration of the character strings.
In the embodiment of the present application, as a preferred embodiment, the multiple segments of character strings to be matched are determined through the following steps:
acquiring the character length of the reference character string;
in this step, the number of the obtained reference character strings may be multiple, and when the number of the reference character strings is multiple, the maximum character length of all the reference character strings is obtained; if there is one reference character string, the character length of the reference character string is only required to be obtained.
And determining the division step length of the text to be matched based on the character length of the reference character string.
In this step, the character length of the reference character string is used as a dividing step length, and the text to be matched is divided based on the dividing step length.
And based on the division step length, dividing the character strings of the text to be matched by taking the first character of the text to be matched as a starting point, and determining a plurality of sections of character strings to be matched.
In the embodiment of the present application, the division step is a length of each character, and a one-dimensional array formed by character strings of a text to be matched is divided, where the division step is a length of a character of a reference character string, for example: [ a1, a2, a3 | a4, a5, a6 | … … | an-2, an-1, an ], thereby determining a plurality of strings to be matched.
In the embodiment of the present application, as a preferred embodiment, the boundary character string to be matched is determined by the following steps:
determining the character length of the boundary character string to be matched;
in this step, since the boundary character string to be matched is a plurality of boundary characters extracted from two adjacent segments of the boundary character string to be matched, the character length of the newly formed boundary character string to be matched is uncertain, and further the character length of the boundary character string to be matched needs to be determined.
Extracting boundary characters from two adjacent segments of character strings to be matched based on the character length of the boundary character strings to be matched;
and determining the extracted boundary character as a boundary character string to be matched.
In the step, based on the determined character length of the boundary character string to be matched, the boundary characters are extracted from two adjacent sections of the character string to be matched, so that the character length of the boundary character string to be matched, which is composed of the extracted boundary characters, is the same as the predetermined character length.
In the embodiment of the present application, as a preferred embodiment, the character length of the boundary character string to be matched is determined by the following steps:
determining the character length of boundary characters extracted from the mutually adjacent sides of any two adjacent character strings to be matched based on the character length of the reference character string;
and determining the character length of the boundary character string to be matched based on the character length of the boundary character.
In the step, the character length of the boundary character string to be matched is the sum of the character lengths of the boundary characters extracted from the mutually adjacent sides of the two adjacent sections of the character strings to be matched.
Specifically, the character length of the boundary character string to be matched is determined by the following formula:
M=2×(m-1);
wherein M represents the character length of the boundary character string to be matched, M represents the character length of the reference character string, and M-1 represents the character length of the boundary character.
According to the formula, m-1 boundary characters are extracted from the adjacent sides of any two adjacent character strings to be matched, and the character length of the boundary character string to be matched is determined to be 2 x (m-1), wherein m represents the character length of the reference character string.
In the embodiment of the present application, as a preferred embodiment, step S240 includes:
determining initial characters of the character string to be matched and the boundary character string to be matched;
and respectively searching a target character string which is the same as the reference character string from each section of character string to be matched and each section of boundary character string to be matched by taking the determined initial character as a starting point and the character length of the reference character string as a matching step length.
In the step, the character strings to be matched are one-dimensional arrays, the determined initial characters are used as starting points, the total character length of the character strings to be matched is used as an end point, the character length of the reference character string is used as a matching step length, the target character string which is the same as the reference character string is searched from each section of the character strings to be matched, and the same is true for the reference character string from the boundary character string to be matched.
The embodiment of the application provides a character string matching method, when character string matching is carried out, a text to be matched is firstly divided into a plurality of sections of character strings to be matched, wherein the character length of the character strings to be matched is larger than or equal to the character length of a reference character string, meanwhile, in order to avoid missing boundary characters of each section of character string to be matched, at least one boundary character is respectively extracted from one side, adjacent to each other, of any two adjacent sections of character strings to be matched, so that a plurality of sections of boundary character strings to be matched are obtained, each section of boundary character string to be matched comprises a plurality of boundary characters extracted from the two adjacent sections of character strings to be matched, the character length of each section of boundary character string to be matched is larger than or equal to the character length of the reference character string, when character string matching is carried out, a target character string matched with the reference character string is determined from the plurality of sections of character strings to be, and finally, counting the number of the target character strings matched with the reference character strings. Therefore, when the character strings are matched, the integrity of all the matched characters in the matching process can be ensured, the character string matching efficiency is effectively improved, the consumption of the character string matching time is greatly reduced, and the character string matching performance is improved.
Based on the same inventive concept, a character string matching system corresponding to the character string matching method is provided in the embodiments of the present application, and because the principle of solving the problem of the character string matching system in the embodiments of the present application is similar to that of the character string matching method in the embodiments of the present application, the implementation of the system can refer to the implementation of the method, and repeated details are not repeated.
Referring to fig. 3 and 4, fig. 3 is a first schematic structural diagram of a string matching system according to an embodiment of the present application, and fig. 4 is a second schematic structural diagram of a string matching system according to an embodiment of the present application. As shown in fig. 3, the character string matching system 300 includes:
an obtaining module 310, configured to obtain a text to be matched and a reference character string for the text to be matched;
a first determining module 320, configured to determine multiple segments of character strings to be matched from the text to be matched, where a character length of the character string to be matched is greater than or equal to a character length of the reference character string;
the second determining module 330 is configured to extract at least one boundary character from each of mutually adjacent sides of any two adjacent segments of to-be-matched character strings, and determine a plurality of segments of to-be-matched boundary character strings, where each segment of to-be-matched boundary character string includes a plurality of boundary characters extracted from two adjacent segments of to-be-matched character strings, and a character length of each segment of to-be-matched boundary character string is greater than or equal to a character length of the reference character string;
and the third determining module 340 is configured to determine a target character string matched with the reference character string from the multiple segments of the character strings to be matched and the multiple segments of the boundary character strings to be matched.
Further, as shown in fig. 4, after the third determining module 340 is configured to determine a target character string matching the reference character string from the multiple segments of character strings to be matched and the multiple segments of boundary character strings to be matched, the character string matching system 300 further includes:
and a counting module 350, configured to count the number of the target character strings matched with the reference character string.
In this embodiment of the application, the first determining module 320 is configured to determine a plurality of segments of character strings to be matched by:
acquiring the character length of the reference character string;
determining the division step length of the text to be matched based on the character length of the reference character string;
and based on the division step length, dividing the character strings of the text to be matched by taking the first character of the text to be matched as a starting point, and determining a plurality of sections of character strings to be matched.
In this embodiment, as a preferred embodiment, the second determining module 330 is configured to determine the boundary character string to be matched through the following steps:
determining the character length of the boundary character string to be matched;
extracting boundary characters from two adjacent segments of character strings to be matched based on the character length of the boundary character strings to be matched;
and determining the extracted boundary character as a boundary character string to be matched.
In this embodiment of the application, the second determining module 330 is further configured to determine the character length of the boundary character string to be matched by:
determining the character length of boundary characters extracted from the mutually adjacent sides of any two adjacent character strings to be matched based on the character length of the reference character string;
and determining the character length of the boundary character string to be matched based on the character length of the boundary character.
In this embodiment, as a preferred embodiment, the second determining module 330 is configured to determine the character length of the boundary character string to be matched by the following formula:
M=2×(m-1);
wherein M represents the character length of the boundary character string to be matched, M represents the character length of the reference character string, and M-1 represents the character length of the boundary character.
In this embodiment of the application, when the third determining module 340 is configured to determine, from the multiple segments of the character strings to be matched and the multiple segments of the boundary character strings to be matched, a target character string that matches the reference character string, the third determining module 340 is specifically configured to:
determining initial characters of the character string to be matched and the boundary character string to be matched;
and respectively searching a target character string which is the same as the reference character string from each section of character string to be matched and each section of boundary character string to be matched by taking the determined initial character as a starting point and the character length of the reference character string as a matching step length.
The embodiment of the application provides a character string matching system, which comprises an acquisition module, a first determination module, a second determination module and a third determination module; the acquisition module is used for acquiring a text to be matched and a reference character string aiming at the text to be matched; the first determining module is used for determining a plurality of sections of character strings to be matched from the text to be matched, wherein the character length of the character strings to be matched is greater than or equal to the character length of the reference character string; the second determining module is used for respectively extracting at least one boundary character from the mutually adjacent sides of any two adjacent sections of character strings to be matched and determining a plurality of sections of boundary character strings to be matched, wherein each section of boundary character string to be matched comprises a plurality of boundary characters extracted from the two adjacent sections of character strings to be matched, and the character length of each section of boundary character string to be matched is greater than or equal to that of the reference character string; and the third determining module is used for determining a target character string matched with the reference character string from the multiple segments of character strings to be matched and the multiple segments of boundary character strings to be matched.
Therefore, when the character strings are matched, the integrity of all the matched characters in the matching process can be ensured, the character string matching efficiency is effectively improved, the consumption of the character string matching time is greatly reduced, and the character string matching performance is improved.
Referring to fig. 5, fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 5, the electronic device 500 includes a processor 510, a memory 520, and a bus 530.
The memory 520 stores machine-readable instructions executable by the processor 510, when the electronic device 500 runs, the processor 510 communicates with the memory 520 through the bus 530, and when the machine-readable instructions are executed by the processor 510, the steps of the character string matching method shown in fig. 1 or fig. 2 may be performed.
An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the character string matching method described in fig. 1 or fig. 2 may be executed.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the exemplary embodiments of the present application, and are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (7)

1. A character string matching method, characterized in that the character string matching method comprises:
acquiring a text to be matched and a reference character string aiming at the text to be matched;
determining a plurality of sections of character strings to be matched from the text to be matched, wherein the character length of the character strings to be matched is greater than or equal to the character length of the reference character string;
determining a plurality of segments of character strings to be matched by the following steps: acquiring the character length of the reference character string; determining the division step length of the text to be matched based on the character length of the reference character string; based on the division step length, with the first character of the text to be matched as a starting point, dividing character strings of the text to be matched, and determining a plurality of sections of character strings to be matched;
respectively extracting at least one boundary character from the mutually adjacent sides of any two adjacent sections of character strings to be matched, and determining a plurality of sections of boundary character strings to be matched; each segment of boundary character string to be matched comprises a plurality of boundary characters extracted from two adjacent segments of character strings to be matched, and the character length of the boundary character string to be matched is the sum of the character lengths of the boundary characters extracted from the adjacent sides of the two adjacent segments of character strings to be matched;
determining the character length of the boundary character string to be matched by the following formula:
M=2×(m-1);
wherein, M represents the character length of the boundary character string to be matched, M represents the character length of the reference character string, and M-1 represents the character length of the boundary character;
and determining a target character string matched with the reference character string from the plurality of sections of character strings to be matched and the plurality of sections of boundary character strings to be matched.
2. The character string matching method according to claim 1, wherein after determining a target character string that matches the reference character string from among the plurality of pieces of character strings to be matched and the plurality of pieces of boundary character strings to be matched, the character string matching method further comprises:
and counting the number of the target character strings matched with the reference character strings.
3. The character string matching method according to claim 1, wherein the boundary character string to be matched is determined by:
determining the character length of the boundary character string to be matched;
extracting boundary characters from two adjacent segments of character strings to be matched based on the character length of the boundary character strings to be matched;
and determining the extracted boundary character as a boundary character string to be matched.
4. The character string matching method according to claim 1, wherein the determining of the target character string matching the reference character string from the plurality of segments of the character string to be matched and the plurality of segments of the boundary character string to be matched comprises:
determining initial characters of the character string to be matched and the boundary character string to be matched;
and respectively searching a target character string which is the same as the reference character string from each section of character string to be matched and each section of boundary character string to be matched by taking the determined initial character as a starting point and the character length of the reference character string as a matching step length.
5. A string matching system, characterized in that the string matching system comprises:
the acquisition module is used for acquiring a text to be matched and a reference character string aiming at the text to be matched;
the first determining module is used for determining a plurality of sections of character strings to be matched from the text to be matched, wherein the character length of the character strings to be matched is greater than or equal to the character length of the reference character string;
the first determining module is used for determining a plurality of segments of character strings to be matched through the following steps: acquiring the character length of the reference character string; determining the division step length of the text to be matched based on the character length of the reference character string; based on the division step length, with the first character of the text to be matched as a starting point, dividing character strings of the text to be matched, and determining a plurality of sections of character strings to be matched;
the second determining module is used for respectively extracting at least one boundary character from the mutually adjacent side of any two adjacent sections of character strings to be matched and determining a plurality of sections of boundary character strings to be matched, wherein each section of boundary character string to be matched comprises a plurality of boundary characters extracted from the two adjacent sections of character strings to be matched, and the character length of the boundary character string to be matched is the sum of the character lengths of the boundary characters extracted from the mutually adjacent sides of the two adjacent sections of character strings to be matched;
the second determining module is used for determining the character length of the boundary character string to be matched through the following formula:
M=2×(m-1);
wherein, M represents the character length of the boundary character string to be matched, M represents the character length of the reference character string, and M-1 represents the character length of the boundary character;
and the third determining module is used for determining a target character string matched with the reference character string from the multiple segments of character strings to be matched and the multiple segments of boundary character strings to be matched.
6. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when an electronic device is operating, the machine-readable instructions when executed by the processor performing the steps of the string matching method of any of claims 1 to 4.
7. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, performs the steps of the string matching method according to any one of claims 1 to 4.
CN202010538767.5A 2020-06-13 2020-06-13 Character string matching method and character string matching system Active CN111581459B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010538767.5A CN111581459B (en) 2020-06-13 2020-06-13 Character string matching method and character string matching system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010538767.5A CN111581459B (en) 2020-06-13 2020-06-13 Character string matching method and character string matching system

Publications (2)

Publication Number Publication Date
CN111581459A CN111581459A (en) 2020-08-25
CN111581459B true CN111581459B (en) 2021-06-15

Family

ID=72123829

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010538767.5A Active CN111581459B (en) 2020-06-13 2020-06-13 Character string matching method and character string matching system

Country Status (1)

Country Link
CN (1) CN111581459B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117709298B (en) * 2024-02-05 2024-05-07 中国电子信息产业集团有限公司第六研究所 Double character stream scanning method, electronic equipment, storage medium and system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101551803A (en) * 2008-03-31 2009-10-07 华为技术有限公司 Method and device for establishing pattern matching state machine and pattern recognition
CN102063510A (en) * 2011-01-17 2011-05-18 珠海全志科技有限公司 Method for searching matched character string
CN106648141A (en) * 2016-12-26 2017-05-10 北京小米移动软件有限公司 Candidate word display method and device
JP2017167882A (en) * 2016-03-17 2017-09-21 日本電気株式会社 Sentence boundary estimation device, method, and program
JP2018036787A (en) * 2016-08-30 2018-03-08 キヤノン株式会社 Information processor, display control method of character string, and program for character string edition
CN111191087A (en) * 2019-12-31 2020-05-22 歌尔股份有限公司 Character matching method, terminal device and computer-readable storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8725749B2 (en) * 2012-07-24 2014-05-13 Hewlett-Packard Development Company, L.P. Matching regular expressions including word boundary symbols
CN110457603B (en) * 2019-08-16 2021-08-06 中国电子信息产业集团有限公司第六研究所 User relationship extraction method and device, electronic equipment and readable storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101551803A (en) * 2008-03-31 2009-10-07 华为技术有限公司 Method and device for establishing pattern matching state machine and pattern recognition
CN102063510A (en) * 2011-01-17 2011-05-18 珠海全志科技有限公司 Method for searching matched character string
JP2017167882A (en) * 2016-03-17 2017-09-21 日本電気株式会社 Sentence boundary estimation device, method, and program
JP2018036787A (en) * 2016-08-30 2018-03-08 キヤノン株式会社 Information processor, display control method of character string, and program for character string edition
CN106648141A (en) * 2016-12-26 2017-05-10 北京小米移动软件有限公司 Candidate word display method and device
CN111191087A (en) * 2019-12-31 2020-05-22 歌尔股份有限公司 Character matching method, terminal device and computer-readable storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"Importance of Aho-Corasick String Matching";Saima Hasib, Mahak Motwani, Amit Saxena;《International Journal of Computer Science and Information Technologies》;20131231;第4卷(第3期);467-469页 *
"一种基于Aho-Corasick算法改进的多模式匹配算法";陈永杰、吾守尔·斯拉木于、清;《现代电子技术》;20190215;第42卷(第4期);89-93页 *
"一种改进的AC多模式匹配算法";刘春晖、黄宇、宋琦;《计算机工程》;20151015;第41卷(第10期);280-285页 *

Also Published As

Publication number Publication date
CN111581459A (en) 2020-08-25

Similar Documents

Publication Publication Date Title
JP6526329B2 (en) Web page training method and apparatus, search intention identification method and apparatus
CN102084363B (en) A method for efficiently supporting interactive, fuzzy search on structured data
CN107918604B (en) Chinese word segmentation method and device
US20150186503A1 (en) Method, system, and computer readable medium for interest tag recommendation
JP5010885B2 (en) Document search apparatus, document search method, and document search program
Mohammed et al. Glove word embedding and DBSCAN algorithms for semantic document clustering
Petkos et al. Two-level Message Clustering for Topic Detection in Twitter.
CN106708947B (en) Web article forwarding and identifying method based on big data
CN108647322B (en) Method for identifying similarity of mass Web text information based on word network
Goyal et al. Approximate scalable bounded space sketch for large data nlp
CN105302807B (en) Method and device for acquiring information category
Qi et al. Location aware keyword query suggestion based on document proximity
US20190238564A1 (en) Method of cyberthreat detection by learning first-order rules on large-scale social media
CN111581459B (en) Character string matching method and character string matching system
CN111737966B (en) Document repetition detection method, device, equipment and readable storage medium
Jurgens et al. Event detection in blogs using temporal random indexing
CN113094519A (en) Method and device for searching based on document
Chappell et al. Approximate nearest-neighbour search with inverted signature slice lists
Bhattacharjee et al. BISDBx: towards batch-incremental clustering for dynamic datasets using SNN-DBSCAN
CN102708104B (en) Method and equipment for sorting document
CN113705217B (en) Literature recommendation method and device for knowledge learning in electric power field
CN113449063B (en) Method and device for constructing document structure information retrieval library
CN111563276B (en) Webpage tampering detection method, detection system and related equipment
JP5694989B2 (en) Document classification apparatus and program
Varol et al. CoDet: Sentence-based containment detection in news corpora

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant