CN110909532B - User name matching method and device, computer equipment and storage medium - Google Patents

User name matching method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN110909532B
CN110909532B CN201911053197.4A CN201911053197A CN110909532B CN 110909532 B CN110909532 B CN 110909532B CN 201911053197 A CN201911053197 A CN 201911053197A CN 110909532 B CN110909532 B CN 110909532B
Authority
CN
China
Prior art keywords
word
matched
word segmentation
segmentation result
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911053197.4A
Other languages
Chinese (zh)
Other versions
CN110909532A (en
Inventor
杨可歆
葛新蕾
袁野
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Unionpay Smart Information Services Shanghai Co ltd
Original Assignee
Unionpay Smart Information Services Shanghai Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Unionpay Smart Information Services Shanghai Co ltd filed Critical Unionpay Smart Information Services Shanghai Co ltd
Priority to CN201911053197.4A priority Critical patent/CN110909532B/en
Publication of CN110909532A publication Critical patent/CN110909532A/en
Application granted granted Critical
Publication of CN110909532B publication Critical patent/CN110909532B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

A user name matching method, a user name matching device, computer equipment and a storage medium are provided, wherein the user name matching method comprises the following steps: acquiring a user entry to be matched; performing word segmentation on the user entry to be matched according to a preset word segmentation rule to obtain a word segmentation result; marking a plurality of word blocks in the word segmentation result, wherein the plurality of word blocks comprise at least two of geographic word blocks, brand word blocks and category word blocks; acquiring a reference user name, and calculating the similarity between the entry of the user to be matched and the reference user name according to a plurality of dimensions, wherein the dimensions correspond to the word blocks one by one; and when the similarity is higher than a preset value, determining that the entry of the user to be matched is matched with the reference user name. By the method, the user name can be accurately matched, and the influence of interference words in the user name is avoided.

Description

User name matching method and device, computer equipment and storage medium
Technical Field
The invention relates to the technical field of computers, in particular to a user name matching method, a user name matching device, computer equipment and a storage medium.
Background
Banks or large merchants often manage corresponding user transaction data according to user names and store the corresponding data in an internal database, however, the user names stored in the database are usually unstructured texts which are not arranged according to a set rule, and may contain geographical location information and partially spoken interference words.
When the current user name is matched, only the precise matching of all fields can be adopted, and the geographic position information and the interference words contained in the user name can cause interference when the geographic position information and the interference words are precisely matched, so that the accuracy of a matching result is poor.
Disclosure of Invention
The technical problem solved by the invention is how to accurately match the user name and avoid the influence of interference words in the user name.
In order to solve the above technical problem, an embodiment of the present invention provides a user name matching method, where the method includes: acquiring a user entry to be matched; performing word segmentation on the user entry to be matched according to a preset word segmentation rule to obtain a word segmentation result; marking a plurality of word blocks in the word segmentation result, wherein the plurality of word blocks comprise at least two of geographic word blocks, brand word blocks and category word blocks; acquiring a reference user name, and calculating the similarity between the entry of the user to be matched and the reference user name according to a plurality of dimensions, wherein the dimensions correspond to the word blocks one by one; and when the similarity is higher than a preset value, determining that the entry of the user to be matched is matched with the reference user name.
Optionally, the marking a plurality of word blocks in the word segmentation result, where the plurality of word blocks include at least two of a geographic word block, a brand word block, and a category word block, includes: and marking geographic word blocks, brand word blocks and category word blocks in the word segmentation result.
Optionally, the marking geographic word blocks, brand word blocks, and category word blocks in the word segmentation result includes: marking a geographic word block in the word segmentation result; acquiring a reference document corresponding to the to-be-matched user entry; calculating the occurrence frequency of the word segmentation result in the reference document, and calculating the text falling frequency of the word segmentation result according to the occurrence frequency; taking the word segmentation result with the minimum inverted text frequency as the brand word block, and marking the brand word block; and taking words except the geographic word block and the brand word block in the word segmentation result as the category word block, and marking the category word block.
Optionally, marking a geographic word block in the word segmentation result includes: acquiring a geographical configuration table, wherein the region configuration table is established according to geographical relations among regions; and marking the geographic word block from the word segmentation result according to the geographic configuration table.
Optionally, the segmenting the to-be-matched user entry according to a preset segmentation rule includes: performing word segmentation on the user entry to be matched to obtain an initial word segmentation result; optimizing the initial word segmentation result, wherein the optimizing comprises: performing secondary word segmentation on the initial word segmentation result to obtain a secondary word segmentation result; acquiring a training corpus, and calculating the co-occurrence rate of characters in the secondary word segmentation result according to the training corpus; judging boundary words in the secondary word segmentation result; obtaining the word segmentation result according to the co-occurrence rate and the boundary word
Optionally, after determining that the to-be-matched user entry matches with the reference user name, the method further includes: and adding the to-be-matched user entry which is not matched with the reference user name into the reference user name.
The embodiment of the invention also provides a user name matching device, which comprises: the vocabulary entry acquisition module is used for acquiring the vocabulary entries of the users to be matched; the word segmentation module is used for segmenting words of the user entry to be matched according to a preset word segmentation rule to obtain a word segmentation result; the word block marking module is used for marking a plurality of word blocks in the word segmentation result, wherein the word blocks comprise at least two of geographic word blocks, brand word blocks and category word blocks; the similarity calculation module is used for acquiring a reference user name and calculating the similarity between the entry of the user to be matched and the reference user name according to a plurality of dimensions, wherein the dimensions correspond to the word blocks one by one; and the matching module is used for determining that the to-be-matched user entry is matched with the reference user name when the similarity is higher than a preset value.
Optionally, the word block tagging module is further configured to tag a geographic word block, a brand word block, and a category word block in the word segmentation result.
The embodiment of the present invention further provides a computer device, which includes a memory and a processor, where the memory stores computer instructions capable of running on the processor, and the processor executes the steps of the method when executing the computer instructions.
The embodiment of the invention also provides a storage medium, wherein computer instructions are stored on the storage medium, and the computer instructions execute the steps of the method when running.
Compared with the prior art, the technical scheme of the embodiment of the invention has the following beneficial effects:
the embodiment of the invention provides a user name matching method, which comprises the following steps: acquiring a user entry to be matched; performing word segmentation on the user entry to be matched according to a preset word segmentation rule to obtain a word segmentation result; marking a plurality of word blocks in the word segmentation result, wherein the plurality of word blocks comprise at least two of geographic word blocks, brand word blocks and category word blocks; acquiring a reference user name, and calculating the similarity between the entry of the user to be matched and the reference user name according to a plurality of dimensions, wherein the dimensions correspond to the word blocks one by one; and when the similarity is higher than a preset value, determining that the entry of the user to be matched is matched with the reference user name. Compared with the prior art, the user name matching method in the application carries out word segmentation on the user name from three dimensions of geography, brand and category, carries out corresponding word block matching according to different rules respectively, finally combines corresponding dimension matching results of at least two word blocks to serve as a final matching result of the user name, can remove interfering words in the word segmentation process, thereby avoiding the influence of the interfering words, and ensures the comprehensiveness and accuracy of the matching result from different dimensions.
Furthermore, a reference document can be introduced as a basis for identifying the brand word block, and a word with the minimum inverted text frequency is used as the brand word block, so that the brand word block can be accurately identified from the word segmentation result.
Further, secondary word segmentation is carried out on the primary word segmentation result of the jieba word segmentation tool, the primary word segmentation result is corrected, and the accuracy of the word segmentation result is improved.
Further, for the user entry to be matched, which does not match with all the stored reference user names, the entry can be used as a new user name for subsequent user name matching operation.
Drawings
FIG. 1 is a flowchart of a user name matching method in an embodiment of the invention;
FIG. 2 is a flowchart of step S106 in FIG. 1 according to an embodiment of the present invention;
FIG. 3 is a flow chart of an optimization manner for initial segmentation results in an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a user name matching apparatus in an embodiment of the present invention.
Detailed Description
As background art, the user name matching method in the existing technical solution is usually precise matching, and the accuracy of the matching result is poor.
In order to solve the above technical problem, an embodiment of the present invention provides a user name matching method, where the method includes: acquiring a user entry to be matched; performing word segmentation on the user entry to be matched according to a preset word segmentation rule to obtain a word segmentation result; marking a plurality of word blocks in the word segmentation result, wherein the plurality of word blocks comprise at least two of geographic word blocks, brand word blocks and category word blocks; acquiring a reference user name, and calculating the similarity between the entry of the user to be matched and the reference user name according to a plurality of dimensions, wherein the dimensions correspond to the word blocks one by one; and when the similarity is higher than a preset value, determining that the entry of the user to be matched is matched with the reference user name.
Compared with the prior art, the user name matching method in the application carries out word segmentation on the user name from three dimensions of geography, brand and category, carries out corresponding word block matching according to different rules respectively, finally combines corresponding dimension matching results of at least two word blocks to serve as a final matching result of the user name, can remove interfering words in the word segmentation process, thereby avoiding the influence of the interfering words, and ensures the comprehensiveness and accuracy of the matching result from different dimensions.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.
Fig. 1 is a flowchart of a user name matching method according to an embodiment of the present invention. The user name matching method of the embodiment may include the following steps:
and S102, acquiring the entries of the user to be matched.
The to-be-matched user entry is an entry containing a user name and input by a terminal of a user or an inquirer, for example, "an afterglow household appliance store" and "the unie-logue of the dark-wood mahogany gymnastics club" are respectively a user entry, and when the terminal needs to match one user entry with the user name stored in the database, the user entry is used as the to-be-matched user entry.
And step S104, performing word segmentation on the entry of the user to be matched according to a preset word segmentation rule to obtain a word segmentation result.
The preset word segmentation rule is a set rule for segmenting the acquired to-be-matched user entry, namely a rule for dividing the acquired to-be-matched user entry into a plurality of word blocks; the preset word segmentation rules can comprise rules for segmenting words according to geographical nouns and rules for identifying and segmenting words according to set keywords of brands and categories; the existing word segmentation software, such as jieba (a Python-based chinese word segmentation method), may also be used for semantic recognition rules after performing semantic analysis training according to Neuro-Linguistic Programming (NLP).
Specifically, after obtaining the entry of the user to be matched, obtaining the preset word segmentation rule set from the position where the preset word segmentation rule is stored, and segmenting the entry of the user to be matched according to the preset word segmentation rule to obtain a corresponding word segmentation result, where the word segmentation result matches with the preset word segmentation rule, which may include: a block of words containing geographic nouns, a block of words containing brand keywords, a block of words containing category keywords, a number of possible interfering blocks of words, etc.
And step S106, marking a plurality of word blocks in the word segmentation result, wherein the plurality of word blocks comprise at least two of geographic word blocks, brand word blocks and category word blocks.
After the word segmentation result is obtained, marking a plurality of word blocks for user name identification in the word segmentation result, marking the word blocks containing geographic nouns as geographic word blocks, marking the word blocks containing brand keywords as brand word blocks, marking the word blocks containing category keywords as category word blocks, and performing next user name identification by at least including two of the geographic word blocks, the brand word blocks and the category word blocks.
Step S108, obtaining a reference user name, and calculating the similarity between the entry of the user to be matched and the reference user name according to a plurality of dimensions, wherein the dimensions correspond to word blocks one to one.
Wherein, the reference user name is a stored user name, and the user name with a historical transaction record can be read from the database; or the technical personnel establishes a common user name comparison table, and the common user name is stored in the table and is specially used for user name matching; in one example, a merchant name that was active in the last 12 months (transaction amount >0) may be drawn from 2019, month 3 as a common user name.
Specifically, after the word block tagging of the word segmentation result is completed in step S106, it is necessary to perform respective matching according to each word block, that is, to perform matching of each word block from different dimensions. And each dimension calculates the similarity between each dimension word block and the corresponding word block in the reference user name according to the word block matching mode, the reference user name is also divided into word blocks with multiple dimensions, the divided dimensions correspond to the word blocks marked in the word segmentation result of the user entry to be matched one by one, if three word blocks of a geographic word block, a brand word block and a category word block are marked in the word segmentation result, the reference user name is also divided into three word blocks of a geographic dimension, a brand dimension and a category dimension, and the similarity of the word blocks in the three dimensions is respectively calculated, so that the overall similarity between the user entry to be matched and the reference user name is obtained.
Optionally, after the word block type marked by the word segmentation result of each to-be-matched user entry is set, the database or the established user name comparison table may be stored according to the dimension corresponding to the word block type of the to-be-compared word segmentation result, so that the word block of each dimension may be directly obtained from the database or the user name comparison table, and the similarity between the word block and each word block in the word segmentation result of the to-be-matched user entry is respectively calculated. For example, for the similarity of the geographic dimensions, if the geographic word block in the entry of the user to be matched corresponds to one geographic location of the geographic dimension word block in the reference user name, the similarity is considered to be 1(1 is the maximum value, that is, the two are completely the same), and if the provinces and cities of the geographic locations corresponding to the geographic word block in the entry of the user to be matched and the geographic dimension word block in the reference user name are the same, but the counties and towns are different, the corresponding similarity score may be set, for example, to 0.8 or 0.7, and the like. Correspondingly, the brand word block and the category word block can also be provided with corresponding similarity algorithms so as to calculate and obtain the similarity between the user entries to be matched and the reference user names in the brand dimension and the category dimension.
And step S110, when the similarity is higher than a preset value, determining that the entry of the user to be matched is matched with the name of the reference user.
Specifically, if the similarity between the entry of the user to be matched and the reference user name calculated according to the dimensions is higher than a preset value, the entry of the user to be matched and the stored reference user name can be considered to correspond to a unified user, that is, the entry of the user to be matched and the reference user name are determined to be matched. The preset value is a threshold value for judging whether the similarity between the vocabulary entry of the user to be matched and the reference user name represents a unified user corresponding to the vocabulary entry of the user to be matched and the reference user name, and specific numerical values can be obtained by summarizing according to calculation experience.
According to the user name matching method in the embodiment, the user names are segmented from three dimensions of geography, brand and category, corresponding word block matching is carried out according to different rules, finally, corresponding dimension matching results of at least two word blocks are combined to serve as a final matching result of the user names, interfering words can be removed in the segmentation process, accordingly, the influence of the interfering words is avoided, matching is carried out from different dimensions, and comprehensiveness and accuracy of matching results are guaranteed.
In one embodiment, step S106 in fig. 1 marks a plurality of word blocks in the word segmentation result, the plurality of word blocks including at least two of a geographic word block, a brand word block, and a category word block, and may include: and marking geographic word blocks, brand word blocks and category word blocks in the word segmentation result.
When word blocks are marked in the word segmentation result, geographic word blocks, brand word blocks and category word blocks can be marked, in step S106, the similarity between the vocabulary entry of the user to be matched and the reference user name is calculated from three dimensions of the geographic word blocks, the brand word blocks and the category word blocks, several parts included in the user name are all included in the similarity calculation scheme, the similarity between the vocabulary entry of the user to be matched and the reference user name is calculated more accurately, and the matching result of the user name is more accurate.
In one embodiment, when step S106 in fig. 1 marks a plurality of word blocks in the segmentation result, where the plurality of word blocks include at least two of a geographic word block, a brand word block, and a category word block, that means that the geographic word block, the brand word block, and the category word block are marked in the segmentation result, step S106 may specifically include:
step S202, marking geographic word blocks in the word segmentation result.
After the word segmentation result is obtained by segmenting the word of the entry of the user to be matched, the geographic word block is marked from the word segmentation result according to the geographic noun. A geographical term library may be created to identify words in the segmentation results that represent geographical locations and label these words as geographical word blocks.
Step S204, obtaining a reference document corresponding to the vocabulary entry of the user to be matched.
The reference document of the to-be-matched user entry is a related document containing the to-be-matched user entry, such as a news report or introduction material of a user. When the user name matching is carried out on the to-be-matched user entry, a reference document containing the to-be-matched user entry can be crawled from the internet to serve as a word segmentation basis.
Step S206, calculating the occurrence frequency of the word segmentation result in the reference document, and calculating the text inversion frequency of the word segmentation result according to the occurrence frequency.
And calculating the importance degree of a word in the whole corpus according to the frequency of the occurrence of the word in the reference text and the frequency of the occurrence of the document in the whole corpus, thereby retaining the words with higher importance degree in the word segmentation result. Wherein, the larger the value of the number of times, the greater the importance of the word to the text. Further, the inverse text frequency of each word in the word segmentation result in the reference document can be calculated, and the calculation formula for calculating the importance degree of a word in the whole corpus according to the frequency of the word appearing in the text and the frequency of the document appearing in the whole corpus is as follows:
tfidfi,j=tfi,j×idfi,j
wherein tfidfi,jRepresenting word frequency tfi,jAnd inverted text word frequency idfi,jThe product of (a). The larger the TF-IDF value, the greater the importance of the word to the reference text. tf isi,j(wherein TF is Term Frequency): the frequency of a certain keyword appearing in the whole article is represented by the following calculation formula:
Figure BDA0002255859240000081
wherein the molecule ni,jFor the number of times each word t appears in the reference text, the denominator Σknk,jThen the number of words in all the word segmentation results in the reference text dj. The calculated result is the word frequency of the word of a certain word segmentation result.
Figure BDA0002255859240000082
Idf (inversdocument frequency): indicating that the inverted text frequency is calculated. The text frequency refers to the number of times a certain keyword appears in all articles in the whole corpus. The reverse text frequency, as the name implies, is the inverse of the text frequency, and is primarily used to reduce the effect of some common but less influential words on documents in all documents.
| D | represents the total number of texts in the corpus, and | Dti | represents the number of the feature words ti contained in the texts. To prevent this word from not being present in the corpus, i.e., the denominator is 0, 1+ | Dti | is used as the denominator.
And S208, taking the word segmentation result with the minimum inverted text frequency as a brand word block, and marking the brand word block.
Specifically, words contained in the word segmentation result are selected as brand word blocks of the name of the merchant, and the brand word blocks are marked.
And step S210, taking words except the geographic word block and the brand word block in the word segmentation result as category word blocks, and marking the category word blocks.
After the geographic word blocks and the brand word blocks are labeled according to the above steps S202 to S208, the remaining word blocks may be regarded as category word blocks and labeled.
In the embodiment, the reference document is introduced as a basis for identifying the brand word block, and the word with the minimum inverted text frequency is used as the brand word block, so that the brand word block is accurately identified from the word segmentation result.
In one embodiment, step S202 in fig. 2 marks a geographic word block in the word segmentation result, which may include: acquiring a geographical configuration table, wherein the region configuration table is established according to geographical relations among regions; and marking the geographic word block from the word segmentation result according to the geographic configuration table.
The geographical configuration table is a table for the names of regions, and the distance relationship between regions and the management relationship between regions (management relationship established by the management logic of province-city, state-county-town). Technicians can crawl a relation data chain between regions from electronic maps such as Baidu maps or published map information on the network, and establish a geographic configuration table as a basis for marking geographic word blocks in the word segmentation result. The geographic configuration table can be stored and shared in the form of electronic spreadsheet (EXCEL), and can also be in other common chart forms, such as electronic maps and the like.
Specifically, a technician may build a geographic configuration table according to a relational data chain between regions crawled from an electronic map such as a Baidu map or published map information on a network, store the geographic configuration table in a fixed position as a marking basis of a geographic word block, read the geographic configuration table from the fixed position each time the geographic word block needs to be marked in a word segmentation result, judge whether the word segmentation result includes a noun of the region according to names of the regions in the geographic configuration table, and mark a corresponding word block in the word segmentation result as the geographic word block if the word block includes the noun of the region.
In addition, in step S108 in fig. 1, the similarity between the geographic word block in the entry of the user to be matched and the word block in the geographic dimension of the reference user name may also be calculated according to the relationship between the regions in the geographic configuration table. And circularly searching each business name from the ith character (i is less than the length of the business name character), and if i + n (n < len (business name) -i) is in the region configuration table, keeping records. When the geographic names are matched, if the corresponding cities are consistent, the matching is considered to be successful, and finally, the result with the deepest depth (the depth of the provincial region) is judged to be the final result, and the provincial region field is correspondingly supplemented.
In this embodiment, a geographic configuration table is established to label geographic word blocks and calculate similarity between geographic word blocks in a geographic dimension, and a uniform and accurate standard is used to label and match geographic word blocks, so as to ensure accuracy of matching user names in the geographic dimension.
In an embodiment, the segmenting the to-be-matched user entry according to the preset segmentation rule in step S102 in fig. 1 may include: performing word segmentation on the entry of the user to be matched to obtain an initial word segmentation result; and optimizing the initial word segmentation result. Referring to fig. 3, the optimization method for the initial word segmentation result may specifically include the following steps:
and step S302, carrying out secondary word segmentation on the initial word segmentation result to obtain a secondary word segmentation result.
Specifically, when the term of the user to be matched is segmented in step S102 in fig. 1, the term of the user to be matched may be segmented by using a common segmentation tool, such as a jieba term segmentation, but due to the particularity of the user name, many brand terms are disassembled, and cannot form a complete brand term. The word segmentation result of the word segmentation tool can be used as an initial word segmentation result, then secondary word segmentation operation is carried out on the initial word segmentation result, and secondary word segmentation can be carried out by adopting an N-gram language model to obtain a secondary word segmentation result.
For example, for a business name, a character string denoted as ABCDEFGHIJK … is split into 2-grams, and character string combinations such as AB, BC, CD, DE, EF, FG, and GH … are obtained, and these character string combinations are secondary word segmentation results.
Step S304, a training corpus is obtained, and the co-occurrence rate of characters in the secondary word segmentation result is calculated according to the training corpus.
The training corpus is a corpus obtained by analyzing word summaries according to history, and can be a corpus established according to commonly used words and historical participle corpora. According to the secondary word segmentation result in the step S302, the number of times each character string group appears in the training corpus is determined, and the co-occurrence rate is calculated: the frequency of the combination of the characters which are adjacent and co-occur in the training corpus can be counted, and the mutual occurrence information of the characters can be calculated, so as to calculate the adjacent co-occurrence rate of two adjacent characters in the secondary word segmentation result, wherein the co-occurrence rate is f (AB), f (BC), f (CD), f (DE), f (EF), F (FG), f (GH) …
And step S306, judging the boundary words in the secondary word segmentation result.
The determination method of the boundary word specifically includes:
and (3) confidence calculation: measuring the possibility that each position of the characters in the secondary word segmentation result relative to each N value is a word segmentation boundary;
Figure BDA0002255859240000101
wherein, Conn(i) Representing the confidence of each word after word segmentation; n denotes n in the n-gram, i and k denote the position of one of the characters, fren(i) For the frequency of the character corresponding to each position i in the word in the training corpus, a judgment function Assert (x, y) is additionally defined, and the judgment function satisfies the following conditions:
Figure BDA0002255859240000102
normalizing the confidence index according to the following formula to obtain average (i) which is used as a standard for final word segmentation judgment, so that the obtained index is not limited by the defined distance N:
Figure BDA0002255859240000111
comparing the character k of each position as a result of the fourth step, and if the following condition is satisfied, the character of the k position is taken as a boundary character.
Average(k)>Average(k-1)&Average(k)>Average(k+1)
If the value average (k) corresponding to a certain character k is greater than the average (k-1) of the previous character and greater than the average (k +1) of the next character, the character is judged as a boundary character and is not combined with the next character into a word, and therefore the boundary character in each word is calculated.
And step S308, obtaining word segmentation results according to the co-occurrence rate and the boundary words.
Specifically, the co-occurrence rate of the characters obtained in step S304 can determine the association relationship between the characters, and step S306 determines that the boundary word can determine the boundary between different word blocks in the word segmentation result, so as to obtain a more accurate word segmentation result.
In the embodiment, secondary word segmentation is performed on the primary word segmentation result of the jieba word segmentation tool, and the primary word segmentation result is corrected, so that the accuracy of the word segmentation result is improved.
In an embodiment, after determining that the to-be-matched user entry matches the reference user name in step S110 in fig. 1, the method may further include: and adding the to-be-matched user entry which is not matched with the reference user name into the reference user name.
Specifically, if a to-be-matched user entry is not matched with all stored reference user names, the to-be-matched user entry is stored as a new user, that is, the to-be-matched user entry is added to the reference user names.
Optionally, the new user may be marked, and a secondary user name matching operation with higher accuracy may be performed on the new user, and the validity of the new user name may be determined by manual matching and other methods for the secondary user name matching. When it is determined that this username is legitimate and there are no records in the database, it is added to the reference username as the new username.
In this embodiment, for the to-be-matched user entry that does not match all the stored reference user names, the to-be-matched user entry may be used as a new user name for subsequent user name matching operation.
Referring to fig. 4, fig. 4 is a schematic structural diagram of a user name matching apparatus according to an embodiment of the present invention, the apparatus including:
and the entry obtaining module 100 is configured to obtain an entry of the user to be matched.
And the word segmentation module 200 is configured to perform word segmentation on the entry of the user to be matched according to a preset word segmentation rule to obtain a word segmentation result.
A word block tagging module 300, configured to tag a plurality of word blocks in the word segmentation result, where the plurality of word blocks includes at least two of a geographic word block, a brand word block, and a category word block.
The similarity calculation module 400 is configured to obtain a reference user name, and calculate similarities between a to-be-matched user entry and the reference user name according to multiple dimensions, where the dimensions correspond to word blocks one to one.
And the matching module 500 is configured to determine that the entry of the user to be matched is matched with the reference user name when the similarity is higher than a preset value.
In one embodiment, the word block tagging module 300 in fig. 4 is further configured to tag a geographic word block, a brand word block, and a category word block in the segmentation result.
In one embodiment, when the word block tagging module 300 in fig. 4 is used for tagging geographic word blocks, brand word blocks and category word blocks in the segmentation result, the word block tagging module 300 may include:
and the geographic word block marking unit is used for marking the geographic word block in the word segmentation result.
And the reference document acquisition unit is used for acquiring a reference document corresponding to the entry of the user to be matched.
And the inverted text frequency calculation module is used for calculating the occurrence frequency of the word segmentation result in the reference document and calculating the inverted text frequency of the word segmentation result according to the occurrence frequency.
And the brand word block marking unit is used for marking the brand word block by taking the word segmentation result with the minimum inverted text frequency as the brand word block.
And the category word block marking unit is used for marking the words except the geographic word blocks and the brand word blocks in the word segmentation result as category word blocks.
In one embodiment, the geographic word block tagging unit may include:
and the geographical configuration table acquisition subunit is used for acquiring the geographical configuration table, and the region configuration table is established according to the geographical relationship among the regions.
And the geographic word block marking subunit is used for marking the geographic word block from the word segmentation result according to the geographic configuration table.
In one embodiment, the word segmentation module 200 in fig. 4 includes:
and the preliminary word segmentation unit is used for performing word segmentation on the entry of the user to be matched to obtain an initial word segmentation result.
And the secondary word segmentation unit is used for carrying out secondary word segmentation on the initial word segmentation result to obtain a secondary word segmentation result.
And the co-occurrence rate calculating unit is used for acquiring the training corpus and obtaining the co-occurrence rate of the characters in the secondary word segmentation result according to the training corpus.
And the boundary word judging unit is used for judging the boundary words in the secondary word segmentation result.
And the word segmentation result acquisition unit is used for acquiring the word segmentation result according to the co-occurrence rate and the boundary words.
In one embodiment, the user name matching apparatus may further include:
and the new user name establishing module is used for adding the to-be-matched user entry which is not matched with the reference user name into the reference user name.
Those skilled in the art understand that the user name matching apparatus in this embodiment may be used to implement the technical solution of the user name matching method in the embodiments shown in fig. 1 to fig. 3.
For more details of the operation principle and the operation mode of the user name matching apparatus, reference may be made to the related descriptions in fig. 1 to fig. 3, which are not described herein again.
Further, the embodiment of the present invention also discloses a storage medium, on which computer instructions are stored, and when the computer instructions are executed, the technical solutions of the methods in the embodiments shown in fig. 1 to fig. 3 are executed. Preferably, the storage medium may include a computer-readable storage medium such as a non-volatile (non-volatile) memory or a non-transitory (non-transient) memory. The storage medium may include ROM, RAM, magnetic or optical disks, and the like.
Further, the embodiment of the present invention also discloses a computer device, which includes a memory and a processor, where the memory stores computer instructions capable of running on the processor, and the processor executes the technical solutions of the methods in the embodiments shown in fig. 1 to fig. 3 when executing the computer instructions.
Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (7)

1. A user name matching method, the method comprising:
acquiring a user entry to be matched;
performing word segmentation on the user entry to be matched according to a preset word segmentation rule to obtain a word segmentation result;
marking geographic word blocks, brand word blocks and category word blocks in the word segmentation result;
acquiring a reference user name, and calculating the similarity between the entry of the user to be matched and the reference user name according to a plurality of dimensions, wherein the dimensions correspond to the word blocks one by one;
when the similarity is higher than a preset value, determining that the entry of the user to be matched is matched with the reference user name;
wherein, said marking geographic word blocks, brand word blocks and category word blocks in said word segmentation result comprises:
marking a geographic word block in the word segmentation result;
acquiring a reference document corresponding to the to-be-matched user entry;
calculating the occurrence frequency of the word segmentation result in the reference document, and calculating the text falling frequency of the word segmentation result according to the occurrence frequency;
taking the word segmentation result with the minimum inverted text frequency as the brand word block, and marking the brand word block;
and taking words except the geographic word block and the brand word block in the word segmentation result as the category word block, and marking the category word block.
2. The method of claim 1, wherein tagging geographic word blocks in the segmentation results comprises:
acquiring a geographical configuration table, wherein the geographical configuration table is established according to geographical relations among regions;
and marking the geographic word block from the word segmentation result according to the geographic configuration table.
3. The method according to claim 1, wherein the segmenting the to-be-matched user entry according to a preset segmentation rule comprises:
performing word segmentation on the user entry to be matched to obtain an initial word segmentation result;
optimizing the initial word segmentation result, wherein the optimizing comprises:
performing secondary word segmentation on the initial word segmentation result to obtain a secondary word segmentation result;
acquiring a training corpus, and calculating the co-occurrence rate of characters in the secondary word segmentation result according to the training corpus;
judging boundary words in the secondary word segmentation result;
and obtaining the word segmentation result according to the co-occurrence rate and the boundary word.
4. The method of claim 1, wherein after determining that the to-be-matched user entry matches the reference user name, further comprising:
and adding the to-be-matched user entry which is not matched with the reference user name into the reference user name.
5. A user name matching apparatus, the apparatus comprising:
the vocabulary entry acquisition module is used for acquiring the vocabulary entries of the users to be matched;
the word segmentation module is used for segmenting words of the user entry to be matched according to a preset word segmentation rule to obtain a word segmentation result;
the word block marking module is used for marking geographic word blocks, brand word blocks and category word blocks in the word segmentation result;
the similarity calculation module is used for acquiring a reference user name and calculating the similarity between the entry of the user to be matched and the reference user name according to a plurality of dimensions, wherein the dimensions correspond to the word blocks one by one;
the matching module is used for determining that the to-be-matched user entry is matched with the reference user name when the similarity is higher than a preset value;
wherein, the word block tagging module 300 comprises:
the geographic word block marking unit is used for marking a geographic word block in the word segmentation result;
the reference document acquisition unit is used for acquiring a reference document corresponding to the vocabulary entry of the user to be matched;
the inverted text frequency calculation module is used for calculating the occurrence frequency of the word segmentation result in the reference document and calculating the inverted text frequency of the word segmentation result according to the occurrence frequency;
the brand word block marking unit is used for marking the brand word block by taking the word segmentation result with the minimum inverted text frequency as the brand word block;
and the category word block marking unit is used for marking the words except the geographic word blocks and the brand word blocks in the word segmentation result as category word blocks.
6. A computer device comprising a memory and a processor, the memory having stored thereon computer instructions executable on the processor, wherein the processor, when executing the computer instructions, performs the steps of the method of any one of claims 1 to 4.
7. A storage medium having stored thereon computer instructions, wherein said computer instructions when executed perform the steps of the method of any of claims 1 to 4.
CN201911053197.4A 2019-10-31 2019-10-31 User name matching method and device, computer equipment and storage medium Active CN110909532B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911053197.4A CN110909532B (en) 2019-10-31 2019-10-31 User name matching method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911053197.4A CN110909532B (en) 2019-10-31 2019-10-31 User name matching method and device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN110909532A CN110909532A (en) 2020-03-24
CN110909532B true CN110909532B (en) 2021-06-11

Family

ID=69816112

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911053197.4A Active CN110909532B (en) 2019-10-31 2019-10-31 User name matching method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110909532B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111429219A (en) * 2020-03-25 2020-07-17 京东数字科技控股有限公司 Data confirmation method, device, equipment and storage medium
CN114911999A (en) * 2022-05-24 2022-08-16 中国电信股份有限公司 Name matching method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105320778A (en) * 2015-11-25 2016-02-10 焦点科技股份有限公司 Commodity labeling method suitable for electronic commerce Chinese website
CN107122413A (en) * 2017-03-31 2017-09-01 北京奇艺世纪科技有限公司 A kind of keyword extracting method and device based on graph model

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105893533B (en) * 2016-03-31 2021-05-07 北京奇艺世纪科技有限公司 Text matching method and device
CN108108373B (en) * 2016-11-25 2020-09-25 阿里巴巴集团控股有限公司 Name matching method and device
CN106909611B (en) * 2017-01-11 2020-04-03 北京众荟信息技术股份有限公司 Hotel automatic matching method based on text information extraction
CN107256212A (en) * 2017-06-21 2017-10-17 成都布林特信息技术有限公司 Chinese search word intelligence cutting method
CN109299865B (en) * 2018-09-06 2021-12-17 西南大学 Psychological evaluation system and method based on semantic analysis and information data processing terminal

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105320778A (en) * 2015-11-25 2016-02-10 焦点科技股份有限公司 Commodity labeling method suitable for electronic commerce Chinese website
CN107122413A (en) * 2017-03-31 2017-09-01 北京奇艺世纪科技有限公司 A kind of keyword extracting method and device based on graph model

Also Published As

Publication number Publication date
CN110909532A (en) 2020-03-24

Similar Documents

Publication Publication Date Title
CN110502621B (en) Question answering method, question answering device, computer equipment and storage medium
US11520812B2 (en) Method, apparatus, device and medium for determining text relevance
US10831769B2 (en) Search method and device for asking type query based on deep question and answer
CN108460014B (en) Enterprise entity identification method and device, computer equipment and storage medium
CN112035730B (en) Semantic retrieval method and device and electronic equipment
US7562088B2 (en) Structure extraction from unstructured documents
US20080162455A1 (en) Determination of document similarity
CN112667794A (en) Intelligent question-answer matching method and system based on twin network BERT model
CN111563384B (en) Evaluation object identification method and device for E-commerce products and storage medium
US20150006528A1 (en) Hierarchical data structure of documents
CN111553151A (en) Question recommendation method and device based on field similarity calculation and server
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
JP2014146301A (en) Searching device, searching method and program
CN110909532B (en) User name matching method and device, computer equipment and storage medium
CN113988057A (en) Title generation method, device, equipment and medium based on concept extraction
CN110795942B (en) Keyword determination method and device based on semantic recognition and storage medium
CN111444713B (en) Method and device for extracting entity relationship in news event
CN112015907A (en) Method and device for quickly constructing discipline knowledge graph and storage medium
CN106372232B (en) Information mining method and device based on artificial intelligence
CN110941713B (en) Self-optimizing financial information block classification method based on topic model
CN110717029A (en) Information processing method and system
CN115828893A (en) Method, device, storage medium and equipment for question answering of unstructured document
CN110909128B (en) Method, equipment and storage medium for carrying out data query by using root list
CN113760918A (en) Method, device, computer equipment and medium for determining data blood relationship
JP2017068862A (en) Information processing device, information processing method, and information processing program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant