CN112541051A - Standard text matching method and device, storage medium and electronic equipment - Google Patents

Standard text matching method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN112541051A
CN112541051A CN202011257154.0A CN202011257154A CN112541051A CN 112541051 A CN112541051 A CN 112541051A CN 202011257154 A CN202011257154 A CN 202011257154A CN 112541051 A CN112541051 A CN 112541051A
Authority
CN
China
Prior art keywords
standard
vector
word
text
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011257154.0A
Other languages
Chinese (zh)
Inventor
薛淼
孟格思
李敏
王瑜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Didi Infinity Technology and Development Co Ltd
Original Assignee
Beijing Didi Infinity Technology and Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Didi Infinity Technology and Development Co Ltd filed Critical Beijing Didi Infinity Technology and Development Co Ltd
Priority to CN202011257154.0A priority Critical patent/CN112541051A/en
Publication of CN112541051A publication Critical patent/CN112541051A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a standard text matching method, a standard text matching device, a storage medium and electronic equipment. And screening by calculating first similarity of the vector to be matched and each standard vector to obtain at least one candidate standard vector, and determining corresponding second similarity according to the matching word set and the standard word set corresponding to each candidate standard vector to determine a standard text corresponding to the target standard vector as the target standard vector. The embodiment of the invention carries out the first screening through the similarity of the vector dimension, and carries out the second screening through the similarity of the text dimension to obtain the matched target standard text, so as to improve the precision of the process of matching the standard text through two screening.

Description

Standard text matching method and device, storage medium and electronic equipment
Technical Field
The present invention relates to the field of data processing, and in particular, to a method and an apparatus for matching a standard text, a storage medium, and an electronic device.
Background
In the field of data processing, processing of text data includes processing methods such as correction and clustering. When the text data processing is performed, it is usually necessary to match a standard text corresponding to a text to be processed. The prior art has low accuracy when performing text matching.
Disclosure of Invention
In view of this, embodiments of the present invention provide a standard text matching method, apparatus, storage medium, and electronic device, and aim to improve accuracy of a standard text matching process.
In a first aspect, an embodiment of the present invention discloses a standard text matching method, where the method includes:
determining a text to be matched and a standard text set, wherein the standard text set comprises a plurality of standard texts;
determining a matching word set comprising all matching words in the text to be matched;
determining a standard word set comprising each standard word in each standard text;
determining vectors to be matched corresponding to the texts to be matched according to the matching word sets, and determining standard vectors corresponding to the standard texts according to the standard word sets;
determining a first similarity of the vector to be matched and each standard vector so as to determine at least one candidate standard vector;
determining corresponding second similarity according to the intersection of the standard word set corresponding to each candidate standard vector and the matching word set so as to determine a target standard vector;
and determining a standard text corresponding to the target standard vector as a target standard text corresponding to the text to be matched.
In a second aspect, an embodiment of the present invention discloses a standard text matching apparatus, where the apparatus includes:
the information determining module is used for determining a text to be matched and a standard text set, wherein the standard text set comprises a plurality of standard texts;
the first set determining module is used for determining a matching word set comprising all matching words in the text to be matched;
a second set determining module, configured to determine a standard word set including each standard word in each standard text;
the vector determining module is used for determining vectors to be matched corresponding to the texts to be matched according to the matching word sets and determining standard vectors corresponding to the standard texts according to the standard word sets;
the candidate vector determining module is used for determining the first similarity of the vector to be matched and each standard vector so as to determine at least one candidate standard vector;
the target vector determining module is used for determining corresponding second similarity according to the intersection of the standard word set corresponding to each candidate standard vector and the matching word set so as to determine a target standard vector;
and the target text determining module is used for determining that the standard text corresponding to the target standard vector is the target standard text corresponding to the text to be matched.
In a third aspect, an embodiment of the present invention discloses a computer-readable storage medium for storing computer program instructions, which when executed by a processor implement the method according to the first aspect.
In a fourth aspect, an embodiment of the present invention discloses an electronic device, which includes a memory and a processor, wherein the memory is used for storing one or more computer program instructions, and the one or more computer program instructions are executed by the processor to implement the method according to the first aspect.
The embodiment of the invention further determines the vector to be matched and the plurality of standard vectors by determining the text to be matched and the standard text set comprising the plurality of standard texts, and determining the matching word set corresponding to the text to be matched and the corresponding standard word set in each standard text. And screening by calculating first similarity of the vector to be matched and each standard vector to obtain at least one candidate standard vector, and determining corresponding second similarity according to the matching word set and the standard word set corresponding to each candidate standard vector to determine a standard text corresponding to the target standard vector as the target standard vector. The embodiment of the invention carries out the first screening through the similarity of the vector dimension, and carries out the second screening through the similarity of the text dimension to obtain the matched target standard text, so as to improve the precision of the process of matching the standard text through two screening.
Drawings
The above and other objects, features and advantages of the present invention will become more apparent from the following description of the embodiments of the present invention with reference to the accompanying drawings, in which:
FIG. 1 is a flow chart of a standard text matching method of an embodiment of the present invention;
FIG. 2 is a diagram illustrating a word segmentation process according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating a synonym replacement process according to an embodiment of the present disclosure;
FIG. 4 is a diagram illustrating a process of determining a first similarity according to an embodiment of the present invention;
FIG. 5 is a flowchart of a process for determining a second similarity according to an embodiment of the present invention;
FIG. 6 is a diagram illustrating a process of determining a second similarity according to an embodiment of the present invention;
FIG. 7 is a diagram of a standard text matching apparatus according to an embodiment of the present invention;
fig. 8 is a schematic diagram of an electronic device according to an embodiment of the invention.
Detailed Description
The present invention will be described below based on examples, but the present invention is not limited to only these examples. In the following detailed description of the present invention, certain specific details are set forth. It will be apparent to one skilled in the art that the present invention may be practiced without these specific details. Well-known methods, procedures, components and circuits have not been described in detail so as not to obscure the present invention.
Further, those of ordinary skill in the art will appreciate that the drawings provided herein are for illustrative purposes and are not necessarily drawn to scale.
Unless the context clearly requires otherwise, throughout the description, the words "comprise", "comprising", and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is, what is meant is "including, but not limited to".
In the description of the present invention, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified.
The standard text matching method provided by the embodiment of the invention can be applied to any equipment such as terminal equipment or a server and the like which can be deployed with a text processing framework to which the standard text matching method provided by the embodiment of the invention is applied. The terminal device may be a general-purpose data processing terminal with an acceleration sensor, such as a smart phone or a tablet computer, capable of running a computer program. The server may be a single server or a cluster of servers configured in a distributed manner. The standard text is a candidate matching text set of the text to be matched, and comprises a plurality of standard texts, and the target standard text corresponding to the text to be matched is obtained by matching in each standard text through the standard text matching method of the embodiment of the invention. The standard text matching method is described by taking an example of a server implementation of deploying a corresponding text processing framework. The server determines a standard text set by receiving a plurality of standard texts sent by the terminal equipment or directly acquiring a plurality of standard texts stored in a database and the like, receives a text to be matched sent by the terminal equipment, and performs text matching by operating a text processing framework deployed at the server end by using the standard text matching method of the embodiment of the invention to obtain a target standard text.
Further, the standard text matching method of the embodiment of the invention can be applied to any application scene needing matching through text information, such as the construction of a standardized database in a specific field, the construction of text information clustering in the specific field, the construction of a standard index in the specific field and the like. The application scenario in which the standard text matching method is applied to the construction of the standardized automobile information base is taken as an example for explanation. The standard automobile information is industry standard information in the automobile field and can include automobile brands, names, configurations and other contents. And acquiring a standard automobile information set as a standard text set by means of web crawlers, manual sorting and the like, and taking the automobile information in the current automobile information base as text information to be matched. The standard text matching method of the embodiment of the invention is used for matching the standard automobile information corresponding to each automobile information in the current automobile information base in the standard automobile information set in sequence. Further, text replacement is not performed when the corresponding standard automobile information is the same as the current automobile information, and the current automobile information is replaced when the corresponding standard automobile information is different from the current automobile information.
Fig. 1 is a flowchart of a standard text matching method according to an embodiment of the present invention. As shown in fig. 1, the standard text matching method in the embodiment of the present invention includes the following steps:
and S100, determining a text to be matched and a standard text set.
Specifically, the text to be matched is text data that needs to be subjected to text matching, and may be, for example, text data to be corrected in a text correction process or text data serving as search information in an information search process. The standard text set is a candidate matching text set of the text to be matched, and comprises a plurality of standard texts, and the standard text set is used for matching in each standard text by using the standard text matching method of the embodiment of the invention to obtain a target standard text corresponding to the text to be matched. When the standard text matching method is applied to term standardization correction of a specific field, the text to be matched is non-standard terms of the specific field, and the standard text set is industry standard terms of the specific field. The specific field is described as an example of the field of vehicle maintenance. The text to be matched can be a non-standard accessory name 'goat' in the field of automobile maintenance, and the standard text set comprises industry standard accessory names in the field of automobile maintenance, such as 'steering knuckle', 'brake pad', 'rocker arm' and 'release bearing'. When the standard text matching method is applied to an information searching process, the text to be matched is searching information, and the standard text set is a set of the information to be searched. For example, when a car is searched in a car information base, the text to be matched may be "brand 1 family sports type", and the standard text set is the car information base, which includes car information such as "brand 1 family comfortable type", "brand 2 family sports type", "brand 1 family sports type", "brand B3 family", "brand 5C", and the like. In the embodiment of the invention, the text to be matched can be sent by other terminal equipment or received by an I/O interface in a man-machine interaction mode. The standard text set can be sent by other terminal equipment or directly obtained and determined by a plurality of standard texts prestored in a database.
And S200, determining a matching word set comprising all matching words in the text to be matched.
Specifically, after a text to be matched is determined, words included in the text to be matched are extracted as matching words, so that a corresponding matching word set is determined. In the embodiment of the present invention, the extraction manner of the matching words may be that a text to be matched is input into a word segmentation tool, and a plurality of output word segmentation results are respectively used as matching words to determine a matching word set. The word segmentation tool can be an existing word segmentation tool such as Jieba, SnowNLP, pkuserg, THULAC (this Lexical Analyzer for Chinese), and HanLP. For example, when the text to be matched is "brand a and 1 is a sports type", the set of matched words determined after word segmentation by the word segmentation tool is { "brand a", "1 is a sports type".
Step S300, determining a standard word set comprising each standard word in each standard text.
Specifically, the step S300 may be performed simultaneously with the step S200, or may be performed separately. After the standard information set is determined, extracting each word included in each standard text included in the standard information set as a standard word to determine a corresponding standard word set. The process of determining the standard word set corresponding to each standard text is similar to step S200, and is not described herein again.
Fig. 2 is a schematic diagram of a word segmentation process according to an embodiment of the present invention. As shown in fig. 2, the text to be matched and the standard text may determine the corresponding matching word set and the standard word set by performing word segmentation with a word segmentation tool. That is, for the text to be matched or the standard text 30, the text to be matched or the standard text is input into a predetermined word segmentation tool 21 for word segmentation, and a corresponding matched word set or a standard word set 22 is determined according to the obtained word segmentation result.
In order to improve the accuracy of the text matching result, the embodiment of the present invention may further determine a synonym library according to each standard text, so as to perform preprocessing on the matching word set based on the synonym library. The synonym library comprises a plurality of standard words corresponding to each standard text and corresponding similar words, and the similar words are other standard words with the same meaning as each standard word. In an embodiment of the present invention, the thesaurus may be determined by a pre-trained text matching model. For example, inputting every two standard words in a standard word set corresponding to each standard text into a text matching model, outputting a corresponding matching degree, and determining two words with the matching degree greater than a preset matching degree threshold value as synonym groups determined by the standard text. And determining a synonym library by sorting and combining the synonym groups determined by the standard texts, wherein the sorting and combining process can be, for example, combining and de-duplicating the synonym groups including at least one identical word. Optionally, the synonym library may also be determined according to a standard word set corresponding to each standard text in a manual integration manner.
Further, after a synonym library is determined according to each standard text, the matched word set is preprocessed in a manner that a standard word corresponding to each matched word in the matched word set is determined in the synonym library and the matched words in the matched word set are replaced by the corresponding standard words. The synonym replacement process can increase the similarity of each standard word set of each matching word in the matching word set, and improve the accuracy of the text matching process. The embodiment of the present invention will be described by way of example as applied to the field of automobiles. The matching word set comprises matching words { "guest", "1 model", "urban cross-country" } obtained after segmentation of automobile information to be matched, and the synonym library comprises synonym phrases { "speed": "binshi" }, { "SUV": "urban cross country", "sports" }. Respectively determining the standard words corresponding to the matched words "guest" as "speed", "urban off-road" as "SUV", and obtaining the matched word set after replacement as { "speed", "1 model", "SUV" }.
FIG. 3 is a diagram illustrating a synonym replacement process according to an embodiment of the present disclosure. As shown in fig. 3, after determining the synonym library 30 based on each standard text, it is determined whether each matching word in the matching word set 31 has a corresponding standard word in the synonym library 30, and when there is a corresponding standard word, the standard word replaces the corresponding replacement word to obtain a processed matching word set 32.
Specifically, when the synonym library 30 includes the following synonym groups: "Standard word 1: A. b "," standard word 2: c "," standard word 3: D. f "and" standard word 4: E. g ", and when the matching word set 31 includes" a, H, G, and the standard word2, L ", it is determined that the standard word corresponding to the matching word" a "is" standard word 1 ", the standard word corresponding to the matching word" G "is" standard word 4 ", and the matching word set 32 obtained by replacing the matching word with the corresponding standard word is" standard word 1, H, standard word 4, and standard word2, L ".
Step S400, determining vectors to be matched corresponding to the texts to be matched according to the matching word sets, and determining standard vectors corresponding to the standard texts according to the standard word sets.
Specifically, after a matching word set corresponding to the text to be matched and a standard word set corresponding to each standard text are determined, vector representations corresponding to the text to be matched and each standard text are determined according to the matching word set and the standard word set respectively. That is to say, the vectors to be matched corresponding to the texts to be matched are determined according to the matching word sets, and the standard vectors corresponding to the standard texts are determined according to the standard word sets. In the embodiment of the invention, the process of determining the corresponding vector to be matched according to the matching word set comprises the steps of inputting each matching word in the matching word set into a trained word vector conversion layer respectively, outputting the corresponding matching word vector, and splicing to obtain the vector to be matched. Similarly, the standard vector determination process corresponding to each standard text is to input each standard word in each standard word set into a trained word vector conversion layer, output the corresponding standard word vector, and obtain the corresponding standard vector by concatenation. In the embodiment of the present invention, the word vector conversion layer may be a word2vec word vector conversion model. word2vec is a fully-connected neural network with only one hidden layer, and a word can be quickly and effectively expressed into a vector form through an optimized training model according to a given corpus. Thus, after a matching word or a standard word is input into a word vector conversion word, a corresponding vector representation is output.
Further, the importance of each word in the matching word set and the standard word set can be used for determining the corresponding weight. And after each matching word is converted into a corresponding matching word vector, weighting the matching word vectors according to corresponding weights, and splicing the weighted matching word vectors to obtain the vector to be matched. And for each standard text, after each standard word set is converted into a corresponding standard word vector, weighting the standard word vectors according to corresponding weights, and splicing the weighted standard word vectors to obtain the standard vectors. In the embodiment of the invention, the splicing process comprises the steps of inputting each word vector to be spliced into a splicing layer and outputting a corresponding splicing result. The weighted splicing mode can determine corresponding vector representation according to the importance of each word in the text to be matched or the standard text, and improves the correlation degree of the vector representation and the corresponding text information.
Step S500, determining the first similarity of the vector to be matched and each standard vector to determine at least one candidate standard vector.
Specifically, after determining a vector to be matched for representing a text to be matched and a standard vector corresponding to each standard text, determining similarity between the vector to be matched and each standard vector as a first similarity, and screening each standard vector according to the corresponding first similarity to obtain at least one candidate standard vector. The process of determining the first similarity may be to calculate cosine similarities between the vectors to be matched and the respective standard vectors as the first similarity, and the cosine similarities may be calculated in a manner that the vectors to be matched and the respective standard vectors are input to a similarity calculation layer to be determined. After the first similarity of the vector to be matched and each standard vector is determined, whether the standard vector meets a preset screening rule or not is judged according to the corresponding first similarity, and the standard vector meeting the screening rule is determined to be a candidate standard vector.
Further, the screening rule may be a preset similarity threshold, and when the first similarity corresponding to the to-be-matched vector and the standard vector is greater than the similarity threshold, the standard vector is determined to be a candidate standard vector. For example, when the first similarity of the vector to be matched with the normal vector 1, the normal vector 2 and the normal vector 3 is 0.7, 0.62 and 0.85, respectively, and the preset similarity threshold is 0.65, the normal vector 1 and the normal vector 3 are determined as candidate normal vectors.
Optionally, the screening rule may further be that the number N of candidate standard vectors is preset, after the vectors to be matched and the first similarities corresponding to the standard vectors are determined, the standard vectors are sorted from large to small according to the corresponding first similarities, and the first N standard vectors in the sorting result are determined to be the candidate standard vectors. For example, when the first similarities of the vector to be matched and the normal vector 1, the normal vector 2, the normal vector 3 and the normal vector 4 are 0.7, 0.62, 0.71 and 0.85 respectively, and the number of the preset candidate normal vectors is 2, the result of sorting the normal vectors according to the first similarities is the normal vector 4, the normal vector 3, the normal vector 1 and the normal vector 2, and the normal vector 4 and the normal vector 3 are determined as the candidate normal vectors.
Fig. 4 is a schematic diagram illustrating a process of determining a first similarity according to an embodiment of the present invention. As shown in fig. 4, in the embodiment of the present invention, a text 40 to be matched and a standard text set 41 are determined, and then a matching word set 42 corresponding to the text 40 to be matched and a standard word set 43 corresponding to each standard text are determined respectively through word processing. And respectively inputting each matching word in the matching word set 42 into a word vector conversion layer 44 to determine a vector 45 to be matched according to each output matching word vector. Meanwhile, for each standard text, each standard word is input into the word vector conversion layer 44, so as to determine a corresponding standard vector 46 according to each output standard word vector. The vectors 45 to be matched and the standard vectors 46 are respectively input into a similarity calculation layer 47 to calculate cosine similarity and output corresponding first similarity 48.
Step S600, determining a corresponding second similarity according to the intersection of the standard word set corresponding to each candidate standard vector and the matching word set so as to determine a target standard vector.
Specifically, after each candidate standard vector is obtained through first screening, a standard word set corresponding to each candidate standard vector and a matching word set corresponding to a vector to be matched are determined, and a second similarity is determined according to an intersection of the standard word set and the matching word set. Further, after the second similarity is determined, the target standard vector is determined in the candidate standard vectors according to the corresponding second similarity.
Fig. 5 is a flowchart illustrating a process of determining a second similarity according to an embodiment of the present invention. As shown in fig. 5, the process of determining the second similarity degree includes the following steps:
step S610, determining the intersection of the standard word set corresponding to each candidate standard vector and the matching word set as the same word set.
Specifically, corresponding to each candidate standard vector, determining a corresponding standard word set and a matching word set corresponding to the vector to be matched, and further determining that words existing in the standard word set and the matching word set at the same time are the same words to obtain a corresponding same word set. The embodiment of the present invention will be described by way of example as applied to the field of automobiles. When the standard word set corresponding to the candidate standard vector is { "A brand", "1 series", "sports type", "7-gear manual", "front-mounted precursor" }, and the matching word set is { "A brand", "1 series", "fashion type", "7-gear manual-automatic integral", "front-mounted four-drive" }, the corresponding same word set is determined to be { "A brand", "1 series" }.
Step S620, for each candidate standard vector, calculating a weighted sum of each identical word in the corresponding identical word set to determine a corresponding second similarity.
Specifically, after determining the same word set corresponding to each candidate standard vector, determining a weight value corresponding to each same word in the same word set. The weight value is used for representing the importance of the corresponding word and can be preset according to the part of speech or classification. The embodiment of the invention is applied to the field of automobiles for illustration. The corresponding weight value "brand" can be preset according to the classification of each word in the automobile field: 1 "," vehicle system: 1 "," vehicle type: 0.8 "," transmission: 0.6 "and" drive mode: 0.6". When the same word set corresponding to the candidate standard vector comprises { "brand a", "1 series", "front four-wheel drive" }, determining that the brand a is a brand, the corresponding weight value is 1, the corresponding 1 series is a vehicle series, the corresponding weight value is 1, the corresponding front four-wheel drive "is a drive mode, and the corresponding weight value is 0.6, so that the sum of the weight values of the same words is calculated to obtain a corresponding second similarity of 2.6.
Fig. 6 is a schematic diagram illustrating a process of determining a second similarity according to an embodiment of the present invention. As shown in fig. 6, for each candidate standard vector, a corresponding standard word set 61 is determined, and an intersection of the standard word set 61 and the tagged word set 60 is determined as an identical word set 62. Then, the weight value corresponding to each identical word included in the identical word set 62 is determined, and the sum of the weight values is calculated to obtain a corresponding second similarity 63.
Step S630, determining the candidate standard vector with the largest second similarity as the target standard vector.
Specifically, after the second similarity corresponding to each criterion vector is determined in step S620, a second screening is performed on the candidate criterion vectors according to the corresponding second similarity, so as to determine the candidate criterion vector with the largest second similarity as the target criterion vector.
And S700, determining a standard text corresponding to the target standard vector as a target standard text corresponding to the text to be matched.
Specifically, the standard text corresponding to the target standard vector is determined as the target standard text corresponding to the text to be matched, so that data processing is further performed according to a specific application scene. For example, when the standard text matching method is applied to a text information error correction scene, after a target standard text corresponding to a text to be matched is obtained through matching, the text to be matched is replaced by the target standard text.
The standard text matching method of the embodiment of the invention firstly carries out the first screening through the similarity of the vector dimension, and carries out the second screening through the similarity of the text dimension to obtain the matched target standard text, so as to improve the precision of the process of matching the standard text through two screening.
Fig. 7 is a schematic diagram of a standard text matching apparatus according to an embodiment of the present invention. As shown in fig. 7, the standard text matching means includes an information determination module 70, a first set determination module 71, a second set determination module 72, a vector determination module 73, a candidate vector determination module 74, a target vector determination module 75, and a target text determination module 76.
Specifically, the information determining module 70 is configured to determine a text to be matched and a standard text set, where the standard text set includes a plurality of standard texts.
The first set determining module 71 is configured to determine a matching word set including matching words in the text to be matched.
The second set determining module 72 is configured to determine a set of standard words including each standard word in each of the standard texts.
The vector determining module 73 is configured to determine a to-be-matched vector corresponding to the to-be-matched text according to the matching word set, and determine a standard vector corresponding to each standard text according to each standard word set.
The candidate vector determining module 74 is configured to determine a first similarity between the vector to be matched and each of the standard vectors to determine at least one candidate standard vector.
The target vector determining module 75 is configured to determine a corresponding second similarity according to an intersection of the set of standard words corresponding to each candidate standard vector and the set of matching words, so as to determine a target standard vector.
The target text determining module 76 is configured to determine that the standard text corresponding to the target standard vector is the target standard text corresponding to the text to be matched.
Further, the first set determining module specifically includes:
and the first word segmentation submodule is used for carrying out word segmentation on the text to be matched to obtain a plurality of words to be matched so as to determine a matching word set according to the words to be matched.
Further, the second set determining module specifically includes:
and the second word segmentation submodule is used for performing word segmentation processing on each standard text to obtain a plurality of standard words and determining a corresponding standard word set according to the plurality of corresponding standard words.
Further, the apparatus further comprises:
and the synonym library determining module is used for determining a synonym library according to each standard text, wherein the synonym library comprises a plurality of standard words corresponding to each standard text and standard words with the same meaning as the standard words.
Further, the apparatus further comprises:
a corresponding word determining module, configured to determine, in the synonym library, a standard word corresponding to each matching word in the matching word set;
and the word replacing module is used for replacing the matched words in the matched word set with the corresponding standard words.
Further, the vector determination module comprises:
the first vector determination submodule is used for respectively inputting each matching word in the matching word set into a trained word vector conversion layer and outputting corresponding matching word vectors so as to obtain vectors to be matched through splicing;
and the second vector determination submodule is used for respectively inputting each standard word in each standard word set into a trained word vector conversion layer and outputting a corresponding standard word vector so as to splice to obtain a corresponding standard vector.
Further, the candidate vector determination module comprises:
the first similarity calculation operator module is used for calculating cosine similarity between the vector to be matched and each standard vector so as to determine first similarity;
and the candidate vector determining submodule is used for determining the corresponding standard vector as a candidate standard vector in response to the first similarity being larger than a similarity threshold.
Further, the target vector determination module comprises:
a standard word determining submodule, configured to determine that an intersection between a standard word set and the matching word set, where the standard word set corresponds to each candidate standard vector, is a same word set;
the second similarity calculation submodule is used for calculating the weighted sum of all the same words in the corresponding same word set for all the candidate standard vectors so as to determine the corresponding second similarity;
and the target vector determining submodule is used for determining the corresponding candidate standard vector with the maximum second similarity as the target standard vector.
The standard text matching device of the embodiment of the invention firstly carries out the first screening through the similarity of the vector dimension, and carries out the second screening through the similarity of the text dimension to obtain the matched target standard text, so as to improve the precision of the process of matching the standard text through two screening.
Fig. 8 is a schematic diagram of an electronic device according to an embodiment of the invention. As shown in fig. 8, the electronic device shown in fig. 8 is a general address query device, which includes a general computer hardware structure, which includes at least a processor 80 and a memory 81. The processor 80 and the memory 81 are connected by a bus 82. The memory 81 is adapted to store instructions or programs executable by the processor 80. Processor 80 may be a stand-alone microprocessor or a collection of one or more microprocessors. Thus, the processor 80 implements the processing of data and the control of other devices by executing instructions stored by the memory 81 to perform the method flows of embodiments of the present invention as described above. The bus 82 connects the above components together, as well as to a display controller 83 and a display device and input/output (I/O) device 84. Input/output (I/O) devices 84 may be a mouse, keyboard, modem, network interface, touch input device, motion sensing input device, printer, and other devices known in the art. Typically, the input/output devices 84 are coupled to the system through an input/output (I/O) controller 85.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, apparatus (device) or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may employ a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations of methods, apparatus (devices) and computer program products according to embodiments of the application. It will be understood that each flow in the flow diagrams can be implemented by computer program instructions.
These computer program instructions may be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows.
These computer program instructions may also be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows.
Another embodiment of the invention is directed to a non-transitory storage medium storing a computer-readable program for causing a computer to perform some or all of the above-described method embodiments.
That is, as can be understood by those skilled in the art, all or part of the steps in the method for implementing the embodiments described above may be accomplished by specifying the relevant hardware through a program, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (18)

1. A method for standard text matching, the method comprising:
determining a text to be matched and a standard text set, wherein the standard text set comprises a plurality of standard texts;
determining a matching word set comprising all matching words in the text to be matched;
determining a standard word set comprising each standard word in each standard text;
determining vectors to be matched corresponding to the texts to be matched according to the matching word sets, and determining standard vectors corresponding to the standard texts according to the standard word sets;
determining a first similarity of the vector to be matched and each standard vector so as to determine at least one candidate standard vector;
determining corresponding second similarity according to the intersection of the standard word set corresponding to each candidate standard vector and the matching word set so as to determine a target standard vector;
and determining a standard text corresponding to the target standard vector as a target standard text corresponding to the text to be matched.
2. The method according to claim 1, wherein the determining of the matching word set including the matching words in the text to be matched specifically includes:
performing word segmentation on the text to be matched to obtain a plurality of words to be matched, and determining a matching word set according to the words to be matched.
3. The method according to claim 1, wherein the determining the set of standard words including each standard word in each standard text is specifically:
and respectively carrying out word segmentation processing on each standard text to obtain a plurality of standard words so as to determine a corresponding standard word set according to the corresponding standard words.
4. The method of claim 1, further comprising:
and determining a synonym library according to each standard text, wherein the synonym library comprises a plurality of standard words corresponding to each standard text and standard words with the same meaning as the standard words.
5. The method of claim 4, further comprising:
determining a standard word corresponding to each matching word in the matching word set in the synonym library;
and replacing the matched words in the matched word set with corresponding standard words.
6. The method according to claim 1, wherein the determining the to-be-matched vector corresponding to the text to be matched according to the matching word sets and the determining the standard vector corresponding to each standard text according to each standard word set comprises:
inputting each matching word in the matching word set into a trained word vector conversion layer respectively, and outputting corresponding matching word vectors to obtain vectors to be matched in a splicing manner;
and respectively inputting each standard word in each standard word set into a trained word vector conversion layer, and outputting a corresponding standard word vector to obtain a corresponding standard vector by splicing.
7. The method of claim 1, wherein determining a first similarity between the vector to be matched and each of the normal vectors to determine at least one candidate normal vector comprises:
calculating cosine similarity of the vector to be matched and each standard vector to determine first similarity;
in response to the first similarity being greater than a similarity threshold, determining the corresponding criterion vector as a candidate criterion vector.
8. The method of claim 1, wherein determining a corresponding second similarity according to an intersection of each candidate criterion vector corresponding criterion word set and the matching word set to determine a target criterion vector comprises:
determining the intersection of the standard word set corresponding to each candidate standard vector and the matching word set as the same word set;
for each candidate standard vector, calculating the weighted value sum of each identical word in the corresponding identical word set to determine a corresponding second similarity;
and determining the corresponding candidate standard vector with the maximum second similarity as a target standard vector.
9. A standard text matching apparatus, characterized in that the apparatus comprises:
the information determining module is used for determining a text to be matched and a standard text set, wherein the standard text set comprises a plurality of standard texts;
the first set determining module is used for determining a matching word set comprising all matching words in the text to be matched;
a second set determining module, configured to determine a standard word set including each standard word in each standard text;
the vector determining module is used for determining vectors to be matched corresponding to the texts to be matched according to the matching word sets and determining standard vectors corresponding to the standard texts according to the standard word sets;
the candidate vector determining module is used for determining the first similarity of the vector to be matched and each standard vector so as to determine at least one candidate standard vector;
the target vector determining module is used for determining corresponding second similarity according to the intersection of the standard word set corresponding to each candidate standard vector and the matching word set so as to determine a target standard vector;
and the target text determining module is used for determining that the standard text corresponding to the target standard vector is the target standard text corresponding to the text to be matched.
10. The apparatus according to claim 9, wherein the first set determining module is specifically:
and the first word segmentation submodule is used for carrying out word segmentation on the text to be matched to obtain a plurality of words to be matched so as to determine a matching word set according to the words to be matched.
11. The apparatus according to claim 9, wherein the second set determining module is specifically:
and the second word segmentation submodule is used for performing word segmentation processing on each standard text to obtain a plurality of standard words and determining a corresponding standard word set according to the plurality of corresponding standard words.
12. The apparatus of claim 9, further comprising:
and the synonym library determining module is used for determining a synonym library according to each standard text, wherein the synonym library comprises a plurality of standard words corresponding to each standard text and standard words with the same meaning as the standard words.
13. The apparatus of claim 12, further comprising:
a corresponding word determining module, configured to determine, in the synonym library, a standard word corresponding to each matching word in the matching word set;
and the word replacing module is used for replacing the matched words in the matched word set with the corresponding standard words.
14. The apparatus of claim 9, wherein the vector determination module comprises:
the first vector determination submodule is used for respectively inputting each matching word in the matching word set into a trained word vector conversion layer and outputting corresponding matching word vectors so as to obtain vectors to be matched through splicing;
and the second vector determination submodule is used for respectively inputting each standard word in each standard word set into a trained word vector conversion layer and outputting a corresponding standard word vector so as to splice to obtain a corresponding standard vector.
15. The apparatus of claim 9, wherein the candidate vector determination module comprises:
the first similarity calculation operator module is used for calculating cosine similarity between the vector to be matched and each standard vector so as to determine first similarity;
and the candidate vector determining submodule is used for determining the corresponding standard vector as a candidate standard vector in response to the first similarity being larger than a similarity threshold.
16. The apparatus of claim 9, wherein the target vector determination module comprises:
a standard word determining submodule, configured to determine that an intersection between a standard word set and the matching word set, where the standard word set corresponds to each candidate standard vector, is a same word set;
the second similarity calculation submodule is used for calculating the weighted sum of all the same words in the corresponding same word set for all the candidate standard vectors so as to determine the corresponding second similarity;
and the target vector determining submodule is used for determining the corresponding candidate standard vector with the maximum second similarity as the target standard vector.
17. A computer readable storage medium storing computer program instructions, which when executed by a processor implement the method of any one of claims 1-8.
18. An electronic device comprising a memory and a processor, wherein the memory is configured to store one or more computer program instructions, wherein the one or more computer program instructions are executed by the processor to implement the method of any of claims 1-8.
CN202011257154.0A 2020-11-11 2020-11-11 Standard text matching method and device, storage medium and electronic equipment Pending CN112541051A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011257154.0A CN112541051A (en) 2020-11-11 2020-11-11 Standard text matching method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011257154.0A CN112541051A (en) 2020-11-11 2020-11-11 Standard text matching method and device, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN112541051A true CN112541051A (en) 2021-03-23

Family

ID=75015038

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011257154.0A Pending CN112541051A (en) 2020-11-11 2020-11-11 Standard text matching method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN112541051A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113657113A (en) * 2021-08-24 2021-11-16 北京字跳网络技术有限公司 Text processing method and device and electronic equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113657113A (en) * 2021-08-24 2021-11-16 北京字跳网络技术有限公司 Text processing method and device and electronic equipment

Similar Documents

Publication Publication Date Title
CN108287858B (en) Semantic extraction method and device for natural language
CN110704621A (en) Text processing method and device, storage medium and electronic equipment
CN111858843B (en) Text classification method and device
CN108027814B (en) Stop word recognition method and device
KR20190038243A (en) System and method for retrieving documents using context
CN112347244A (en) Method for detecting website involved in yellow and gambling based on mixed feature analysis
CN106708929B (en) Video program searching method and device
CN112115709A (en) Entity identification method, entity identification device, storage medium and electronic equipment
CN112613293B (en) Digest generation method, digest generation device, electronic equipment and storage medium
CN111274822A (en) Semantic matching method, device, equipment and storage medium
CN112131341A (en) Text similarity calculation method and device, electronic equipment and storage medium
CN115017268B (en) Heuristic log extraction method and system based on tree structure
CN114116973A (en) Multi-document text duplicate checking method, electronic equipment and storage medium
CN112613321A (en) Method and system for extracting entity attribute information in text
CN112818200A (en) Data crawling and event analyzing method and system based on static website
CN114491062B (en) Short text classification method integrating knowledge graph and topic model
CN113361252B (en) Text depression tendency detection system based on multi-modal features and emotion dictionary
CN112541051A (en) Standard text matching method and device, storage medium and electronic equipment
CN113012687B (en) Information interaction method and device and electronic equipment
CN112711944B (en) Word segmentation method and system, and word segmentation device generation method and system
CN110309258B (en) Input checking method, server and computer readable storage medium
CN116680387A (en) Dialogue reply method, device, equipment and storage medium based on retrieval enhancement
CN108959387B (en) Information acquisition method and device
CN106934007B (en) Associated information pushing method and device
CN114996451A (en) Semantic category identification method and device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination