CN111353025B - Parallel corpus processing method and device, storage medium and computer equipment - Google Patents

Parallel corpus processing method and device, storage medium and computer equipment Download PDF

Info

Publication number
CN111353025B
CN111353025B CN201811481225.8A CN201811481225A CN111353025B CN 111353025 B CN111353025 B CN 111353025B CN 201811481225 A CN201811481225 A CN 201811481225A CN 111353025 B CN111353025 B CN 111353025B
Authority
CN
China
Prior art keywords
word
correct
error correction
parallel corpus
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811481225.8A
Other languages
Chinese (zh)
Other versions
CN111353025A (en
Inventor
刘恒友
李辰
包祖贻
徐光伟
李林琳
司罗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201811481225.8A priority Critical patent/CN111353025B/en
Publication of CN111353025A publication Critical patent/CN111353025A/en
Application granted granted Critical
Publication of CN111353025B publication Critical patent/CN111353025B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3322Query formulation using system suggestions
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a parallel corpus processing method, a device, a storage medium and computer equipment. Wherein the method comprises the following steps: acquiring a search data set, segmenting search data words, and counting the word frequency of each segmented word; according to the counted word frequency, determining a candidate set of correct words in the parallel corpus; determining a false word candidate set of each correct word in the candidate set of the correct word; and generating parallel corpus according to each correct word and the error word candidate set of each correct word. The invention solves the technical problem of scarcity of parallel corpus data.

Description

Parallel corpus processing method and device, storage medium and computer equipment
Technical Field
The present invention relates to the field of data processing, and in particular, to a parallel corpus processing method, apparatus, storage medium, and computer device.
Background
In the searching scene, the search words (query) input by the user are corrected, and the query after error correction of the error correction model is used for initiating the search, so that the exposure rate of the searched object can be improved.
However, a large amount of parallel corpus is required for training and optimizing the error correction model, but the error correction parallel corpus is often little or even none, and the cost of manually labeling the parallel corpus is high.
In the related art, search Session (Session) data in the general search field is used to mine error-correcting parallel corpora. The method is based on the principle that when a user inputs errors in the query, if the search result does not meet the expectation, the user corrects the query and extracts the query before and after correction as an error correction parallel corpus. It should be noted that the above solution may not achieve a better effect in all scenes, for example, for an e-commerce searching scene, the error correction parallel corpus cannot be extracted accurately by adopting the above method. For example, in the e-commerce search, the query input by the user is mostly a trade name, but most of the input errors are caused by that the user does not know the exact and complete trade name, in this case, even if the search result does not meet the expectation, the user cannot correct the input query, so that the exact error correction parallel corpus cannot be mined from the e-commerce search Session data. The above description is given by taking error correction parallel corpus as an example, and other types of parallel corpus have similar problems in the related art.
Therefore, in the related art, there is still a technical problem that parallel corpus data is scarce.
In view of the above problems, no effective solution has been proposed at present.
Disclosure of Invention
The embodiment of the invention provides a parallel corpus processing method, a device, a storage medium and computer equipment, which are used for at least solving the technical problem of scarcity of parallel corpus data.
According to an aspect of an embodiment of the present invention, there is provided a parallel corpus processing method, including: acquiring a search data set, segmenting the search data words, and counting the word frequency of each segmented word; according to the counted word frequency, determining a candidate set of correct words in the parallel corpus; determining a false word candidate set of each correct word in the correct word candidate set; and generating the parallel corpus according to each correct word and the error word candidate set of each correct word.
According to another aspect of the embodiment of the present invention, there is also provided a parallel corpus processing method, including: receiving search words input by a user; under the condition that the search word is an error word in error correction parallel corpus, acquiring a correct word corresponding to the error word, wherein the error correction parallel corpus is generated by the following steps: performing word segmentation on a search data set, determining correct words and error word candidate sets of the correct words from the obtained word segmentation, and generating the error correction parallel corpus according to the correct words and the error word candidate sets; searching according to the correct word, and feeding back a search result to the user.
According to another aspect of the embodiment of the present invention, there is also provided a parallel corpus processing apparatus, including: the acquisition module is used for acquiring a search data set, segmenting the search data words, and counting the obtained word frequency of each segmented word; the first determining module is used for determining a candidate set of correct words in the parallel corpus according to the counted word frequency; a second determining module, configured to determine a candidate set of incorrect words of each correct word in the candidate set of correct words; and the generation module is used for generating the parallel corpus according to each correct word and the error word candidate set of each correct word.
According to another aspect of the embodiment of the present invention, there is further provided a storage medium, where the storage medium includes a stored program, and when the program runs, the device where the storage medium is controlled to execute any one of the parallel corpus processing methods described above.
According to another aspect of an embodiment of the present invention, there is also provided a computer apparatus including: a memory and a processor, the memory storing a computer program; the processor is configured to execute a computer program stored in the memory, where the computer program executes the parallel corpus processing method according to any one of the above methods.
In the embodiment of the invention, a search data set is acquired, the search data words are segmented, and the word frequency of each segmented word is counted; according to the counted word frequency, determining a candidate set of correct words in the parallel corpus; determining a false word candidate set of each correct word in the candidate set of the correct word; according to the mode that each correct word and each correct word error word candidate set generate parallel corpus, the error correction parallel corpus is mined through the search data set, so that the purpose of increasing the error correction parallel corpus is achieved, the technical effect of improving the search error correction accuracy is achieved, and the technical problem that parallel corpus data are scarce is solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiments of the invention and together with the description serve to explain the invention and do not constitute a limitation on the invention. In the drawings:
FIG. 1 shows a block diagram of a hardware architecture of a computer terminal (or mobile device) for implementing a parallel corpus processing method;
FIG. 2 is a flow chart of a parallel corpus processing method according to embodiment 1 of the present invention;
FIG. 3 is a flow chart of another parallel corpus processing method according to embodiment 1 of the present invention;
FIG. 4 is a flow chart of a parallel corpus processing method according to a preferred embodiment of the present invention;
FIG. 5 is a flow chart of a parallel corpus processing method according to embodiment 2 of the present invention;
FIG. 6 is a schematic diagram of a parallel corpus processing apparatus according to embodiment 3 of the present invention;
fig. 7 is a block diagram of a computer terminal according to an embodiment of the present invention;
Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
First, partial terms or terminology appearing in describing embodiments of the present application are applicable to the following explanation:
error correction parallel corpus: the error correction training data set is in the format of < correct word, error word >, such as < zizania, shepherd's purse >.
Search for Session data: the user sometimes changes his or her search query several times to find the desired search answer, and these different queries expressing the same need form a set of search Session data.
Damerau-Levenshtein Distance: metrics for measuring edit distance between two character sequences. Is the minimum number of operations to convert from one word to another, and operations include insertion, deletion, alteration of a single character, and conversion of two adjacent characters.
The symmetric deleting spelling error correction algorithm is an error correction algorithm which can quickly search error correction result candidate words which are close to the editing distance of the current word to be corrected from a large number of candidate words.
Example 1
In accordance with an embodiment of the present invention, there is also provided a method embodiment of a parallel corpus processing method, it should be noted that the steps shown in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is shown in the flowcharts, in some cases the steps shown or described may be performed in an order different from that shown or described herein.
The method embodiment provided in the first embodiment of the present application may be executed in a mobile terminal, a computer terminal or a similar computing device. Fig. 1 shows a block diagram of a hardware structure of a computer terminal (or mobile device) for implementing a parallel corpus processing method. As shown in fig. 1, the computer terminal 10 (or mobile device 10) may include one or more (shown as 102a, 102b, … …,102 n) processors 102 (the processors 102 may include, but are not limited to, a microprocessor MCU, a programmable logic device FPGA, etc. processing means), a memory 104 for storing data. In addition, the method may further include: a transmission module, a display, an input/output interface (I/O interface), a Universal Serial Bus (USB) port (which may be included as one of the ports of the I/O interface), a network interface, a power supply, and/or a camera. It will be appreciated by those of ordinary skill in the art that the configuration shown in fig. 1 is merely illustrative and is not intended to limit the configuration of the electronic device described above. For example, the computer terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
It should be noted that the one or more processors 102 and/or other data processing circuits described above may be referred to generally herein as "data processing circuits. The data processing circuit may be embodied in whole or in part in software, hardware, firmware, or any other combination. Furthermore, the data processing circuitry may be a single stand-alone processing module, or incorporated, in whole or in part, into any of the other elements in the computer terminal 10 (or mobile device). As referred to in the embodiments of the present application, the data processing circuit acts as a processor control (e.g., selection of the path of the variable resistor termination to interface).
The memory 104 may be used to store software programs and modules of application software, such as a program instruction/data storage device corresponding to the parallel corpus processing method in the embodiment of the present invention, and the processor 102 executes the software programs and modules stored in the memory 104 to perform various functional applications and data processing, that is, implement the parallel corpus processing method of the application program. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission module is used for receiving or transmitting data through a network. The specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission module includes a network adapter (Network Interface Controller, NIC) that can connect to other network devices through the base station to communicate with the internet. In one example, the transmission module may be a Radio Frequency (RF) module, which is used to communicate with the internet wirelessly.
The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computer terminal 10 (or mobile device).
In order to solve the above-mentioned problems in the related art, in this embodiment, a parallel corpus processing method is provided. Fig. 2 is a flowchart of a parallel corpus processing method according to embodiment 1 of the present invention. As shown in fig. 2, the method comprises the steps of:
step S202, a search data set is obtained, word segmentation is carried out on search data words, and word frequencies of the obtained word segmentation are counted.
As an alternative embodiment, the source of acquiring the search data set may be a client, for example, a mobile terminal, a PC terminal, or the like, or may be a server, for example, a search engine server, a backup server, or the like. The search input window provided by the client can be directly obtained from the window. The server side can store corresponding search data sets and can also be an important source of the search data sets. In addition, besides the searching data set sources, the searching data set can be acquired through various modes such as manual acquisition, software and the like.
As an alternative embodiment, the word segmentation may be to segment and reorganize a continuous word sequence according to a certain rule, where the word segmentation may be performed by a method including: word segmentation based on string matching, word segmentation based on understanding, and word segmentation based on statistics. Wherein, the word segmentation based on the character string matching can comprise: forward matching, reverse matching, etc.; the word segmentation based on understanding includes: simulating the understanding of human words by using a computer, and analyzing the syntax and grammar while segmenting words to eliminate word segmentation ambiguity; the statistics-based segmentation includes: maximum probability word segmentation, maximum entropy word segmentation, etc.
As an alternative embodiment, before the search data word is segmented, the search word may be preprocessed, adverse factors affecting the segmentation are filtered out in advance, the preprocessing includes removing punctuation marks and numbers, and the method may further include deactivating words, wherein the deactivating words may directly use a deactivated word list obtained through statistics and summarization in advance.
As an alternative embodiment, the word frequency may be obtained by statistics according to word segmentation results, specifically, statistics may be performed according to the number of occurrences of the word in a certain period of time, and then part-of-speech division may be performed according to different actual requirements, and the part-of-speech may be divided into corresponding classes, for example, word frequency > =100 times/day may be defined as a high-frequency word, etc.
As an alternative embodiment, the word segmentation process can adopt a mode of combining one or more word segmentation methods, so that the word segmentation efficiency and disambiguation are ensured, for example, word segmentation of counting word frequencies generally adopts a word segmentation dictionary to carry out character string matching, thus combining the statistics of word frequencies with the character string matching, not only improving the word segmentation speed and efficiency, but also utilizing the advantages of word segmentation without dictionary and word segmentation combination context to identify new words and disambiguate.
Step S204, according to the counted word frequency, a candidate set of correct words in the parallel corpus is determined.
As an alternative embodiment, the term frequency is the frequency of occurrence of a word, and in this embodiment, the term frequency is the number of times a certain word (or word) occurs after the word segmentation process is performed on the search dataset.
As an alternative embodiment, the above-mentioned error correction parallel corpus may be used to form an error correction training data set, including correct words and incorrect words, where the error correction training data set may be in the format of < correct words, incorrect words >, e.g., < zizania, shepherd's purse >.
As an alternative embodiment, according to the word frequency counted after word segmentation, the corresponding high-frequency word is combined into a candidate set of correct words.
Step S206, determining a wrong word candidate set of each correct word in the candidate set of correct words.
As an alternative embodiment, the correct words have a set of error word candidates corresponding to the correct words, that is, each correct word has a set of error word candidates corresponding to the correct word, where the set of error word candidates may include a plurality of error words, it should be noted that the words in the set of error word candidates have a certain similarity with the correct words in the set of error word candidates corresponding to the correct words, but the difference between the two is also obvious, that is, the error words cannot truly reflect real information from word sense.
As an alternative embodiment, determining the incorrect word candidate set of each correct word in the candidate set of correct words may be performed by, for example, selecting, for each correct word in the candidate set of correct words, a low frequency word similar to the correct word to form the incorrect word candidate set of the correct word.
As an alternative embodiment, after the candidate set of correct words is generated, one or more words with greater similarity to the correct words are selected from the words after the word segmentation in the search dataset as the candidate set of incorrect words of the correct words.
Step S208, generating parallel corpus according to each correct word and the error word candidate set of each correct word.
As an alternative embodiment, the parallel corpus referred to above may be a plurality of types of parallel corpus corresponding to meeting a plurality of requirements. For example, the parallel corpus with the maximum meaning similarity of the two expressions may be required to be used for determining similar words, or may be an error correction parallel corpus for correcting error words, or may be a parallel corpus with other requirements or meanings. It should be noted that, in this application, error correction parallel corpus is mainly used for illustration.
As an alternative embodiment, since the format of each error correcting parallel corpus employs < correct word, incorrect word >, there may be a corresponding one or more incorrect words for each correct word in the error correcting parallel corpus.
As an optional embodiment, in the generating the error correction parallel corpus, the plurality of error words corresponding to each correct word form an error word candidate set, and the word closest to the correct word is further selected from the error word candidate set, so that each correct word and the corresponding error word thereof generate the error correction parallel corpus.
As an alternative embodiment, generating parallel corpus from each correct word and the set of incorrect word candidates for each correct word may be implemented in the following way: in order to make the generated error correction parallel corpus more accurate, besides directly generating the error correction parallel corpus according to each correct word and each error word candidate set of the correct word, each correct word and each error word candidate set of the correct word can be combined with other information to generate the error correction parallel corpus, for example, for each correct word and corresponding error word candidate set, the error correction parallel corpus is generated according to word frequency, context information, similarity and the like.
In the embodiment of the invention, a search data set is acquired, the search data words are segmented, and the word frequency of each segmented word is counted; according to the counted word frequency, determining a candidate set of correct words in the parallel corpus; determining a false word candidate set of each correct word in the candidate set of the correct word; according to the mode that each correct word and each correct word error word candidate set generate parallel corpus, the error correction parallel corpus is mined through the search data set, so that the purpose of increasing the error correction parallel corpus is achieved, the technical effect of improving the search error correction accuracy is achieved, and the technical problem that parallel corpus data are scarce is solved.
As an alternative embodiment, determining the high-frequency word with word frequency exceeding the preset word frequency threshold according to the counted word frequency; and generating a candidate set of correct words in the parallel corpus according to the high-frequency words.
It should be noted that after the statistical word frequency is obtained, the word frequency needs to be compared with a predetermined word frequency threshold, where the predetermined word frequency threshold may be set according to practical situations, for example, the word frequency is defined as a high-frequency word, and if the requirement for the candidate set of the correct word in the error correction parallel corpus is higher, the predetermined word frequency threshold may be set higher, so that the range of the candidate set of the correct word in the error correction parallel corpus is reduced, and the accuracy of the error correction parallel corpus is improved to a certain extent. However, it is also necessary to set the predetermined word frequency threshold within a reasonable range, and too high or too low a predetermined word frequency threshold has a great influence on the application of the error correction parallel corpus.
As an alternative embodiment, for each correct word in the candidate set of correct words, selecting a predetermined number of incorrect words having the greatest similarity to the correct word; and generating an error word candidate set of the correct word according to the error word.
The candidate set of correct words includes one or more correct words, where the one or more correct words may each occur most frequently in the search dataset. In real life, there is a certain difference in knowledge of everyone, for example, there may be a plurality of deviations in knowledge of the same word, that is, one or more similar error words may appear with respect to the correct word. The similarity can be pinyin, five-stroke codes or common shape and near words. The most visual reflection of the similarity may be the frequency of use, for example, the closer a wrong word is used to the frequency of use of the correct word, the higher the similarity between the wrong word and the correct word is. In general, the number of error words is relatively large compared with the number of correct words, and a predetermined number of error words having the greatest similarity with the correct words can be selected to generate an error word candidate set of the correct words.
As an alternative embodiment, the similarity between the correct word and the other words is determined according to at least one of the following ways: determining similarity according to the edit distance of the pinyin between the correct word and other words; determining similarity according to the editing distance of the five-stroke codes between the correct word and other words; and determining the similarity between the correct word and other words according to the pre-set near-word comparison table.
The similarity is the correlation between the correct word and other words, namely the larger the correlation is, the higher the similarity is; the smaller the correlation, the lower the similarity. And the similarity of the correct word to other words includes one of the following: pinyin, wubi coding and shape-approaching characters. For example, the correct word may be a homophonic different word from other words, or may be a pinyin letter that is different but confusable, and there is a similarity in performing pinyin editing. In five-stroke coding, there is also a similarity in stroke confusion between the correct word and other words. In addition, the correct words and other words have smaller visual difference, belong to the shape-near words and have similarity.
As an alternative embodiment, a symmetric-erasure correction algorithm is used to determine the similarity between the correct word and the other words.
The symmetric deleting spelling error correction algorithm can accelerate the selection of error correction result candidate words with similar editing distances. For example, symmetric erasure spelling correction algorithms are used to accelerate the computation of similarity between correct words and other words, resulting in a temporal complexity of O (n 2 ) Down to a constant level.
As an alternative embodiment, where the parallel corpus includes an error correction parallel corpus, generating the parallel corpus from each correct word and the candidate set of incorrect words for each correct word may include: generating a plurality of candidate error correction parallel corpora according to each correct word and a plurality of error words in the error word candidate set of the correct word aiming at each correct word; and screening error correction parallel corpus from the plurality of candidate error correction parallel corpus.
And according to different screening conditions, the error correction parallel corpus is screened out from the plurality of candidate error correction parallel corpora, and the method can obtain the optimal error correction parallel corpus and can improve the accuracy of the subsequent search error correction.
FIG. 3 is a flowchart of another parallel corpus processing method according to embodiment 1 of the present invention, as shown in FIG. 3, as an alternative embodiment, the error correcting parallel corpus is selected from a plurality of candidate error correcting parallel corpora by at least one of:
Step S302, determining a preset screening condition according to the word frequency of the correct word and the word frequency of the error word forming the candidate error correction parallel corpus with the correct word; screening error correction parallel corpus from a plurality of candidate error correction parallel corpus according to a preset screening condition;
as an alternative embodiment, the predetermined screening condition may be determined by a specific screening rule, for example, the screening rule may be: correct word frequency > =10×wrong candidate word frequency. Corresponding screening rules can be formulated, and error correction parallel corpus is further screened from a plurality of candidate error correction parallel corpus.
Step S304, aiming at each candidate error correction parallel corpus, respectively acquiring the context environment of the correct word and the error word in the candidate error correction parallel corpus; judging whether the candidate error correction parallel corpus is noise corpus or not according to the context environment; acquiring error correction parallel corpus by deleting noise corpus from a plurality of candidate error correction parallel corpus;
as an optional embodiment, the context is composed of a word before and after the correct word or the incorrect word, and through the context, it may be determined whether the correct word or the incorrect word in the candidate error correction parallel corpus is a noise corpus, and if the noise corpus is present, the noise corpus is deleted, so as to obtain the final error correction parallel corpus. Specifically, statistics of word frequency information of correct words and corresponding incorrect candidate words in each context is considered noise if Freq (correct word) < k×freq (candidate of incorrect word), where k is an integer, such as k=10, should be removed from parallel corpus.
Step S306, for each candidate error correction parallel corpus, determining the least operand of the correct word in the candidate error correction parallel corpus to be converted into the error word by adopting a D-L editing algorithm, and judging whether the candidate error correction parallel corpus is noise corpus or not according to the least operand; and acquiring the error correction parallel corpus by deleting the noise corpus from the plurality of candidate error correction parallel corpora.
As an alternative embodiment, the D-L editing algorithm described above, namely the Damerau-Levenshtein Distance editing algorithm, is used to measure a measure of the edit distance between two character sequences. The algorithm is the minimum number of operations to convert from one word to another, including insertion, deletion, alteration of a single character, and conversion of two adjacent characters.
As an optional embodiment, the D-L editing algorithm can convert the correct word in the candidate error correction parallel corpus into the least operand of the error word, further determine whether the candidate error correction parallel corpus is noise corpus according to the least operand, and delete the noise corpus to obtain the error correction parallel corpus.
As an alternative embodiment, the word frequency of the correct word in the candidate error correction parallel corpus and the word frequency of the wrong word in the context are counted respectively; and under the condition that the word frequency of the correct word in the context is smaller than the preset multiple of the word frequency of the wrong word in the context, determining the candidate error correction parallel corpus as the noise corpus.
After obtaining the word frequency of the correct word and the wrong word in the context, judging whether the word frequency of the correct word in the context is smaller than a preset multiple of the word frequency of the wrong word in the context, and if the word frequency of the correct word is less than k times the word frequency of the wrong word, wherein k is an integer, then the parallel noise candidate error correction corpus is considered to be the noise corpus. Otherwise, the noise candidate error correction parallel corpus is not the noise corpus.
As an alternative embodiment, an operand threshold value for determining that the candidate error correcting parallel corpus is a noise corpus is determined; and under the condition that the minimum operand is larger than the operand threshold value, determining the candidate error correction parallel corpus as a noise corpus.
The minimum operand is obtained by a D-L editing algorithm, and the algorithm performs the process of obtaining the minimum operand as follows, for example, converting kitten into sitting, the first step: the kitten is transformed into sitten (s replaces k), the second step: conversion to sittin (i instead of e), third step: conversion to sitting (end insert g) is done with a minimum of 3 operands for this conversion. And comparing the minimum operand with an operand threshold value, and further determining whether the candidate error correction parallel corpus is a noise corpus. If the minimum operand is greater than the operand threshold, the candidate error correction parallel corpus is a noise corpus; if the minimum number of operands is less than the operand threshold, the candidate error correcting parallel corpus is not a noise corpus.
In combination with the foregoing embodiment and the preferred embodiment, the following provides a complete preferred embodiment, and fig. 4 is a flowchart of a parallel corpus processing method according to a preferred embodiment of the present invention, as shown in fig. 4, where the preferred embodiment includes the following:
(1) And pulling the E-commerce to search the query data set, and performing word segmentation on the query data set and counting word frequency Freq (words).
(2) Generating a correct word candidate set: high-frequency words (e.g., word frequency > =100 times/day) are selected to form a correct word candidate set.
(3) A set of false word candidates is generated for each correct word. For each correct word generated in (2), selecting n (such as n=10) with the maximum similarity with the correct word from the word set obtained by word segmentation in (1) to form an error word candidate set of the correct word.
The similarity calculation may take the form of a combination of one or more of the following: a. and calculating the similarity according to the edit distance of the pinyin between the two words. b. Similarity is calculated based on the edit distance of the wubi code between the two words. c. Generating a candidate set of wrong words according to a common shape and word-approaching comparison table downloaded on the Internet.
Since each correct word needs to perform similarity calculation with all words obtained by word segmentation in 1), the time complexity will be relatively high, and in the preferred embodiment, a symmetric deletion spelling error correction algorithm is used to accelerate similarity calculation.
(4) Parallel corpus coarse screening. Screening the error correction parallel corpus < correct word, error candidate word >, and screening rules: freq (correct word) > = 10 x Freq (wrong candidate word).
(5) Noise is filtered based on the context information. The specific method comprises the following steps: a. context information mining. For each parallel corpus < correct word, error candidate word >, the context environment where two words are located is respectively mined, for example: the context of the current word is formed using the front and back words. b. Noise is filtered. Word frequency information of correct words and corresponding error candidate words in each context is counted, and if Freq (correct words) < k×freq (candidates of error words) is regarded as noise, the noise should be removed from parallel corpus. Where k is an integer, such as k=10.
(6) Parallel corpus post-processing: for the parallel corpus generated in the step (5), filtering distant corpus by using Damerau-Levenshtein Distance of Chinese character level; and finally filtering data of the right word and the error candidate word in the parallel corpus.
Through the above preferred embodiment, compared with the method of mining the error correction parallel corpus through the Session search data in the related art, the preferred embodiment not only adapts to a new scene (for example, a new retail electronic commerce search scene), but also performs the mining of the parallel corpus according to the new mode, thereby not only improving the accuracy of error correction, but also realizing the automatic mining of the error correction parallel corpus data, expanding the error correction training data, and improving the error correction effect.
For example, the new retailer search scene is different from the general search scene, the query input by the user is mostly a very short commodity name, and most of the error inputs are that the user does not know the accurate and complete commodity name, so the user is difficult to correct the input of the user, so the parallel corpus of < correct query, error candidate query > rarely appears in the Session search data, and the parallel corpus cannot be mined in the new scene by the method of mining the parallel corpus from the Session log. In the preferred embodiment, the parallel corpus candidates of < correct query, error candidate query > are generated according to the similarity of the voices and fonts among words, instead of mining from the search Session, then the parallel corpus noise data are filtered according to the methods of search frequency, context information, query length, damerau-Levenshtein distance of the correct query and the error query, and finally the high-quality error correction parallel corpus is obtained.
The parallel corpus used in the embodiment can be used in a new scene of e-commerce search error correction by mining error correction according to the search data. In this preferred embodiment, the following treatment is employed: extracting error correction parallel corpus according to the characteristic that the number of times of searching correct words is far greater than that of searching wrong words; the candidate words of the error correction result with similar editing distance are selected in an acceleration way through a symmetrical deleting spelling error correction algorithm; noise filtering is carried out on the parallel corpus according to the context information; noise filtering is carried out on the parallel corpus according to Damerau-Levenshtein Distance; and generating error candidates according to the font similarity, the pinyin similarity and the like, and effectively improving the quality of the excavated error correction parallel corpus.
It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present invention is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present invention. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present invention.
From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method of the various embodiments of the present invention.
Example 2
According to another aspect of the embodiment of the present invention, there is further provided a parallel corpus processing method, and fig. 5 is a flowchart of a parallel corpus processing method according to embodiment 2 of the present invention, as shown in fig. 5, where the method includes:
step S502, receiving search words input by a user;
step S504, obtaining correct words corresponding to the error words under the condition that the search words are the error words in the error correction parallel corpus, wherein the error correction parallel corpus is generated by the following steps: performing word segmentation on the search data set, determining correct words from the obtained word segmentation, and error word candidate sets of the correct words, and generating error correction parallel corpus according to the correct words and the error word candidate sets;
step S506, searching is carried out according to the correct words, and search results are fed back to the user.
In the embodiment of the invention, a mode of correcting the search word by searching the error correction parallel corpus generated by the data set is adopted, and under the condition that a user inputs the error word in the error correction parallel corpus, the user carries out relevant search according to the correct word of the error word and feeds back the correct search result to the user, and the purpose of increasing the error correction parallel corpus is achieved by searching the data set to mine the error correction parallel corpus, so that the technical effect of improving the accuracy of searching the error correction is realized, and the technical problem of scarcity of the parallel corpus data is solved.
As an alternative embodiment, before receiving the search term input by the user, the method further comprises: acquiring search words input by a plurality of users within a preset time period, and generating a search word log; a search dataset is generated from the search term log.
The search data set herein may be obtained from a record of search words input by a plurality of users within a predetermined time, and from a generated search log. In addition, the predetermined time referred to herein may be flexibly determined depending on the specific search object, for example, when the naming of the search object is relatively unique, the predetermined time period may be selected to be shorter because the search term does not change much even in a longer time period; and when the naming of the search object is wider, the predetermined time period may be selected longer, so that if the time is longer, the probability of accuracy may be counted more. Moreover, for the plurality of users, a user within a certain geographic range may be selected, because users within different geographic ranges may be different in naming or speaking of search objects; still alternatively, it may be a user of a certain occupation, since for a certain search object the range of possible applications relates to a certain occupation range, and the user for a certain occupation may embody a specification of naming the search object to a certain extent.
Example 3
According to an embodiment of the present invention, there is further provided an apparatus for processing an error-corrected parallel corpus, and fig. 6 is a schematic diagram of an apparatus for processing an error-corrected parallel corpus according to embodiment 3 of the present invention, as shown in fig. 6, the apparatus includes: the acquisition module 602, the first determination module 604, the second determination module 606, and the generation module 608 are described in detail below.
The acquisition module 602 is configured to acquire a search data set, segment search data words, and count word frequencies of the obtained segmented words; the first determining module 604 is connected to the obtaining module 602, and is configured to determine a candidate set of correct words in the parallel corpus according to the counted word frequency; a second determining module 606, coupled to the first determining module 604, for determining a false word candidate set for each correct word in the correct word candidate set; the generating module 608 is connected to the second determining module 606, and is configured to generate a parallel corpus according to each correct word and the candidate set of incorrect words of each correct word.
In the embodiment of the invention, the error correction parallel corpus processing device is adopted, and the error correction parallel corpus is mined through the search data set, so that the purpose of increasing the error correction parallel corpus is achieved, the technical effect of improving the error correction accuracy of the search is realized, and the technical problem of scarcity of parallel corpus data is further solved.
Here, the above-mentioned obtaining module 602, the first determining module 604, the second determining module 606 and the generating module 608 correspond to steps S202 to S208 in embodiment 1, and the four modules are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to those disclosed in embodiment 1. It should be noted that the above-described module may be operated as a part of the apparatus in the computer terminal 10 provided in embodiment 1.
Example 4
Embodiments of the present invention may provide a computer terminal, which may be any one of a group of computer terminals. Alternatively, in the present embodiment, the above-described computer terminal may be replaced with a terminal device such as a mobile terminal.
Alternatively, in this embodiment, the above-mentioned computer terminal may be located in at least one network device among a plurality of network devices of the computer network.
In this embodiment, the computer terminal may execute the program code of the following steps in the parallel corpus processing method of the application program: acquiring a search data set, segmenting search data words, and counting the word frequency of each segmented word; according to the counted word frequency, determining a candidate set of correct words in the parallel corpus; determining a false word candidate set of each correct word in the candidate set of the correct word; and generating parallel corpus according to each correct word and the error word candidate set of each correct word.
Alternatively, fig. 7 is a block diagram of a computer terminal according to an embodiment of the present invention. As shown in fig. 7, the computer terminal 10 may include: one or more (only one shown) processors 702, memory 704, and peripheral interfaces.
The memory may be used to store software programs and modules, such as program instructions/modules corresponding to the parallel corpus processing method and apparatus in the embodiments of the present invention, and the processor executes the software programs and modules stored in the memory, thereby executing various functional applications and data processing, that is, implementing the parallel corpus processing method described above. The memory may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory may further include memory remotely located with respect to the processor, the remote memory being connectable to the terminal through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The processor may call the information and the application program stored in the memory through the transmission device to perform the following steps: acquiring a search data set, segmenting search data words, and counting the word frequency of each segmented word; according to the counted word frequency, determining a candidate set of correct words in the parallel corpus; determining a false word candidate set of each correct word in the candidate set of the correct word; and generating parallel corpus according to each correct word and the error word candidate set of each correct word.
Optionally, the above processor may further execute program code for: determining high-frequency words with word frequency exceeding a preset word frequency threshold according to the counted word frequency; and generating a candidate set of correct words in the parallel corpus according to the high-frequency words.
Optionally, the above processor may further execute program code for: selecting a preset number of error words with the maximum similarity with the correct words aiming at each correct word in the candidate set of the correct words; and generating an error word candidate set of the correct word according to the error word.
Optionally, the above processor may further execute program code for: the similarity between the correct word and the other words is determined according to at least one of the following ways: determining similarity according to the edit distance of the pinyin between the correct word and other words; determining similarity according to the editing distance of the five-stroke codes between the correct word and other words; and determining the similarity between the correct word and other words according to the pre-set near-word comparison table.
Optionally, the above processor may further execute program code for: a symmetric erasure correction algorithm is used to determine the similarity between the correct word and other words.
Optionally, the above processor may further execute program code for: in the case that the parallel corpus includes an error correction parallel corpus, generating the parallel corpus from each correct word and the set of incorrect word candidates for each correct word includes: generating a plurality of candidate error correction parallel corpora according to each correct word and a plurality of error words in the error word candidate set of the correct word aiming at each correct word; and screening error correction parallel corpus from the plurality of candidate error correction parallel corpus.
Optionally, the above processor may further execute program code for: determining a predetermined screening condition according to the word frequency of the correct word and the word frequency of the error word forming the candidate error correction parallel corpus with the correct word; screening error correction parallel corpus from a plurality of candidate error correction parallel corpus according to a preset screening condition; aiming at each candidate error correction parallel corpus, respectively acquiring the context environment of the correct word and the wrong word in the candidate error correction parallel corpus; judging whether the candidate error correction parallel corpus is noise corpus or not according to the context environment; acquiring error correction parallel corpus by deleting noise corpus from a plurality of candidate error correction parallel corpus; aiming at each candidate error correction parallel corpus, determining the least operand of the correct word in the candidate error correction parallel corpus to be converted into the error word by adopting a D-L editing algorithm, and judging whether the candidate error correction parallel corpus is noise corpus or not according to the least operand; and acquiring the error correction parallel corpus by deleting the noise corpus from the plurality of candidate error correction parallel corpora.
Optionally, the above processor may further execute program code for: respectively counting word frequencies of correct words in the candidate error correction parallel corpus and word frequencies of wrong words in the context; and under the condition that the word frequency of the correct word in the context is smaller than the preset multiple of the word frequency of the wrong word in the context, determining the candidate error correction parallel corpus as the noise corpus.
Optionally, the above processor may further execute program code for: determining an operand threshold value for judging the candidate error correction parallel corpus as the noise corpus; and under the condition that the minimum operand is larger than the operand threshold value, determining the candidate error correction parallel corpus as a noise corpus.
The processor may also call the information stored in the memory and the application program through the transmission device to execute the program code of the following steps: receiving search words input by a user; under the condition that the search word is an error word in error correction parallel corpus, acquiring the correct word corresponding to the error word, wherein the error correction parallel corpus is generated by the following steps: performing word segmentation on the search data set, determining correct words from the obtained word segmentation, and error word candidate sets of the correct words, and generating error correction parallel corpus according to the correct words and the error word candidate sets; searching is carried out according to the correct words, and search results are fed back to the user.
Optionally, the above processor may further execute program code for: acquiring search words input by a plurality of users within a preset time period, and generating a search word log; a search dataset is generated from the search term log.
In the embodiment of the invention, a search data set is acquired, the search data words are segmented, and the word frequency of each segmented word is counted; according to the counted word frequency, determining a candidate set of correct words in the parallel corpus; determining a false word candidate set of each correct word in the candidate set of the correct word; according to the mode that each correct word and each correct word error word candidate set generate parallel corpus, the error correction parallel corpus is mined through the search data set, so that the purpose of increasing the error correction parallel corpus is achieved, the technical effect of improving the search error correction accuracy is achieved, and the technical problem that parallel corpus data are scarce is solved.
It will be appreciated by those skilled in the art that the configuration shown in fig. 7 is only illustrative, and the computer terminal may be a smart phone (such as an Android phone, an iOS phone, etc.), a tablet computer, a palm-phone computer, a mobile internet device (Mobile Internet Devices, MID), a PAD, etc. Fig. 7 is not limited to the structure of the electronic device. For example, the computer terminal 10 may also include more or fewer components (e.g., network interfaces, display devices, etc.) than shown in FIG. 7, or have a different configuration than shown in FIG. 7.
Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program for instructing a terminal device to execute in association with hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.
Example 5
The embodiment of the invention also provides a storage medium. Alternatively, in this embodiment, the storage medium may be used to store the program code executed by the parallel corpus processing method provided in the first embodiment.
Alternatively, in this embodiment, the storage medium may be located in any one of the computer terminals in the computer terminal group in the computer network, or in any one of the mobile terminals in the mobile terminal group.
Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: acquiring a search data set, segmenting search data words, and counting the word frequency of each segmented word; according to the counted word frequency, determining a candidate set of correct words in the parallel corpus; determining a false word candidate set of each correct word in the candidate set of the correct word; and generating parallel corpus according to each correct word and the error word candidate set of each correct word.
Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: determining high-frequency words with word frequency exceeding a preset word frequency threshold according to the counted word frequency; and generating a candidate set of correct words in the parallel corpus according to the high-frequency words.
Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: selecting a preset number of error words with the maximum similarity with the correct words aiming at each correct word in the candidate set of the correct words; and generating an error word candidate set of the correct word according to the error word.
Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: the similarity between the correct word and the other words is determined according to at least one of the following ways: determining similarity according to the edit distance of the pinyin between the correct word and other words; determining similarity according to the editing distance of the five-stroke codes between the correct word and other words; and determining the similarity between the correct word and other words according to the pre-set near-word comparison table.
Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: a symmetric erasure correction algorithm is used to determine the similarity between the correct word and other words.
Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: in the case that the parallel corpus includes an error correction parallel corpus, generating the parallel corpus from each correct word and the set of incorrect word candidates for each correct word includes: generating a plurality of candidate error correction parallel corpora according to each correct word and a plurality of error words in the error word candidate set of the correct word aiming at each correct word; and screening error correction parallel corpus from the plurality of candidate error correction parallel corpus.
Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: determining a predetermined screening condition according to the word frequency of the correct word and the word frequency of the error word forming the candidate error correction parallel corpus with the correct word; screening error correction parallel corpus from a plurality of candidate error correction parallel corpus according to a preset screening condition; aiming at each candidate error correction parallel corpus, respectively acquiring the context environment of the correct word and the wrong word in the candidate error correction parallel corpus; judging whether the candidate error correction parallel corpus is noise corpus or not according to the context environment; acquiring error correction parallel corpus by deleting noise corpus from a plurality of candidate error correction parallel corpus; aiming at each candidate error correction parallel corpus, determining the least operand of the correct word in the candidate error correction parallel corpus to be converted into the error word by adopting a D-L editing algorithm, and judging whether the candidate error correction parallel corpus is noise corpus or not according to the least operand; and acquiring the error correction parallel corpus by deleting the noise corpus from the plurality of candidate error correction parallel corpora. .
Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: respectively counting word frequencies of correct words in the candidate error correction parallel corpus and word frequencies of wrong words in the context; and under the condition that the word frequency of the correct word in the context is smaller than the preset multiple of the word frequency of the wrong word in the context, determining the candidate error correction parallel corpus as the noise corpus.
Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: determining an operand threshold value for judging the candidate error correction parallel corpus as the noise corpus; and under the condition that the minimum operand is larger than the operand threshold value, determining the candidate error correction parallel corpus as a noise corpus.
Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: receiving search words input by a user; under the condition that the search word is an error word in error correction parallel corpus, acquiring the correct word corresponding to the error word, wherein the error correction parallel corpus is generated by the following steps: performing word segmentation on the search data set, determining correct words from the obtained word segmentation, and error word candidate sets of the correct words, and generating error correction parallel corpus according to the correct words and the error word candidate sets; searching is carried out according to the correct words, and search results are fed back to the user.
Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: acquiring search words input by a plurality of users within a preset time period, and generating a search word log; a search dataset is generated from the search term log.
Example 6
According to another aspect of an embodiment of the present invention, there is also provided a computer apparatus including: a memory and a processor, the memory storing a computer program; a processor for executing a computer program stored in the memory, the computer program executing the following steps when running: acquiring a search data set, segmenting search data words, and counting the word frequency of each segmented word; according to the counted word frequency, determining a candidate set of correct words in the parallel corpus; determining a false word candidate set of each correct word in the candidate set of the correct word; and generating parallel corpus according to each correct word and the error word candidate set of each correct word.
Optionally, in this embodiment, the computer program stored in the memory executed by the processor may further perform the following steps: determining high-frequency words with word frequency exceeding a preset word frequency threshold according to the counted word frequency; and generating a candidate set of correct words in the parallel corpus according to the high-frequency words.
Optionally, in this embodiment, the computer program stored in the memory executed by the processor may further perform the following steps: selecting a preset number of error words with the maximum similarity with the correct words aiming at each correct word in the candidate set of the correct words; and generating an error word candidate set of the correct word according to the error word.
Optionally, in this embodiment, the computer program stored in the memory executed by the processor may further perform the following steps: the similarity between the correct word and the other words is determined according to at least one of the following ways: determining similarity according to the edit distance of the pinyin between the correct word and other words; determining similarity according to the editing distance of the five-stroke codes between the correct word and other words; and determining the similarity between the correct word and other words according to the pre-set near-word comparison table.
Optionally, in this embodiment, the computer program stored in the memory executed by the processor may further perform the following steps: a symmetric erasure correction algorithm is used to determine the similarity between the correct word and other words.
Optionally, in this embodiment, the computer program stored in the memory executed by the processor may further perform the following steps: in the case that the parallel corpus includes an error correction parallel corpus, generating the parallel corpus from each correct word and the set of incorrect word candidates for each correct word includes: generating a plurality of candidate error correction parallel corpora according to each correct word and a plurality of error words in the error word candidate set of the correct word aiming at each correct word; and screening error correction parallel corpus from the plurality of candidate error correction parallel corpus.
Optionally, in this embodiment, the computer program stored in the memory executed by the processor may further perform the following steps: determining a predetermined screening condition according to the word frequency of the correct word and the word frequency of the error word forming the candidate error correction parallel corpus with the correct word; screening error correction parallel corpus from a plurality of candidate error correction parallel corpus according to a preset screening condition; aiming at each candidate error correction parallel corpus, respectively acquiring the context environment of the correct word and the wrong word in the candidate error correction parallel corpus; judging whether the candidate error correction parallel corpus is noise corpus or not according to the context environment; acquiring error correction parallel corpus by deleting noise corpus from a plurality of candidate error correction parallel corpus; aiming at each candidate error correction parallel corpus, determining the least operand of the correct word in the candidate error correction parallel corpus to be converted into the error word by adopting a D-L editing algorithm, and judging whether the candidate error correction parallel corpus is noise corpus or not according to the least operand; and acquiring the error correction parallel corpus by deleting the noise corpus from the plurality of candidate error correction parallel corpora. .
Optionally, in this embodiment, the computer program stored in the memory executed by the processor may further perform the following steps: respectively counting word frequencies of correct words in the candidate error correction parallel corpus and word frequencies of wrong words in the context; and under the condition that the word frequency of the correct word in the context is smaller than the preset multiple of the word frequency of the wrong word in the context, determining the candidate error correction parallel corpus as the noise corpus.
Optionally, in this embodiment, the computer program stored in the memory executed by the processor may further perform the following steps: determining an operand threshold value for judging the candidate error correction parallel corpus as the noise corpus; and under the condition that the minimum operand is larger than the operand threshold value, determining the candidate error correction parallel corpus as a noise corpus.
Optionally, in this embodiment, the computer program stored in the memory executed by the processor may further perform the following steps: receiving search words input by a user; under the condition that the search word is an error word in error correction parallel corpus, acquiring the correct word corresponding to the error word, wherein the error correction parallel corpus is generated by the following steps: performing word segmentation on the search data set, determining correct words from the obtained word segmentation, and error word candidate sets of the correct words, and generating error correction parallel corpus according to the correct words and the error word candidate sets; searching is carried out according to the correct words, and search results are fed back to the user.
Optionally, in this embodiment, the computer program stored in the memory executed by the processor may further perform the following steps: acquiring search words input by a plurality of users within a preset time period, and generating a search word log; a search dataset is generated from the search term log.
The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
In the foregoing embodiments of the present invention, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.
In the several embodiments provided in the present application, it should be understood that the disclosed technology content may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, such as the division of the units, is merely a logical function division, and may be implemented in another manner, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-only memory (ROM), a random access memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims (10)

1. The parallel corpus processing method is characterized by comprising the following steps of:
acquiring a search data set, segmenting search data words, and counting the word frequency of each segmented word;
according to the counted word frequency, determining a candidate set of correct words in the parallel corpus;
determining a false word candidate set of each correct word in the correct word candidate set;
generating the parallel corpus according to each correct word and the error word candidate set of each correct word; the error correction parallel corpus is screened from a plurality of candidate error correction parallel corpora by the following method:
aiming at each candidate error correction parallel corpus, respectively acquiring the context environment of the correct word and the wrong word in the candidate error correction parallel corpus; judging whether the candidate error correction parallel corpus is noise corpus or not according to the context environment; obtaining the error correction parallel corpus by deleting the noise corpus from the plurality of error correction parallel corpus candidates, wherein the determining whether the error correction parallel corpus candidates are noise corpus according to the context environment includes: respectively counting word frequencies of correct words in the candidate error correction parallel corpus in the context environment and word frequencies of wrong words in the context environment; and under the condition that the word frequency of the correct word in the context is smaller than the preset multiple of the word frequency of the incorrect word in the context, determining the candidate error correction parallel corpus as a noise corpus.
2. The method of claim 1, wherein determining a candidate set of correct words in the parallel corpus based on the counted word frequencies comprises:
determining high-frequency words with word frequency exceeding a preset word frequency threshold according to the counted word frequency;
and generating a candidate set of correct words in the parallel corpus according to the high-frequency words.
3. The method of claim 1, wherein determining a wrong-word candidate set for each correct word in the candidate set of correct words comprises:
selecting a preset number of error words with the maximum similarity with the correct words aiming at each correct word in the candidate set of the correct words;
and generating an error word candidate set of the correct word according to the error word.
4. The method of claim 3, further comprising, prior to selecting the predetermined number of incorrect words having the greatest similarity to the correct word:
the similarity between the correct word and the other words is determined according to at least one of the following ways: determining similarity according to the edit distance of the pinyin between the correct word and other words; determining similarity according to the editing distance of the five-stroke codes between the correct word and other words; and determining the similarity between the correct word and other words according to the pre-set near-word comparison table.
5. The method of claim 4, wherein the similarity between the correct word and the other words is determined using a symmetric-erasure correction algorithm.
6. The parallel corpus processing method is characterized by comprising the following steps of:
receiving search words input by a user;
under the condition that the search word is an error word in error correction parallel corpus, acquiring a correct word corresponding to the error word, wherein the error correction parallel corpus is generated by the following steps: performing word segmentation on a search data set, determining correct words and error word candidate sets of the correct words from the obtained word segmentation, and generating the error correction parallel corpus according to the correct words and the error word candidate sets; wherein generating the error correction parallel corpus according to the correct word and the error word candidate set includes: generating a plurality of candidate error correction parallel corpora according to each correct word and a plurality of error words in the error word candidate set of the correct word aiming at each correct word; screening error correction parallel corpus from the plurality of candidate error correction parallel corpus; the error correction parallel corpus is screened from the plurality of candidate error correction parallel corpora by the following method: aiming at each candidate error correction parallel corpus, respectively acquiring the context environment of the correct word and the wrong word in the candidate error correction parallel corpus; determining whether the candidate error correction parallel corpus is a noise corpus according to the context environment, wherein the determining whether the candidate error correction parallel corpus is the noise corpus according to the context environment comprises: respectively counting word frequencies of correct words in the candidate error correction parallel corpus in the context environment and word frequencies of wrong words in the context environment; determining that the candidate error correction parallel corpus is a noise corpus under the condition that the word frequency of the correct word in the context is smaller than a preset multiple of the word frequency of the incorrect word in the context; acquiring the error correction parallel corpus by deleting the noise corpus from the plurality of candidate error correction parallel corpora;
Searching according to the correct word, and feeding back a search result to the user.
7. The method of claim 6, further comprising, prior to receiving the user-entered search term:
acquiring search words input by a plurality of users within a preset time period, and generating a search word log;
and generating the search data set according to the search word log.
8. A parallel corpus processing apparatus, comprising:
the acquisition module is used for acquiring a search data set, segmenting the search data words, and counting the obtained word frequency of each segmented word;
the first determining module is used for determining a candidate set of correct words in the parallel corpus according to the counted word frequency;
a second determining module, configured to determine a candidate set of incorrect words of each correct word in the candidate set of correct words;
the generation module is used for generating the parallel corpus according to each correct word and the error word candidate set of each correct word;
the generating module is further configured to generate, when the parallel corpus includes an error correction parallel corpus, the parallel corpus according to the each correct word and the error word candidate set of the each correct word, where the generating includes: generating a plurality of candidate error correction parallel corpora according to each correct word and a plurality of error words in the error word candidate set of the correct word aiming at each correct word; screening error correction parallel corpus from the plurality of candidate error correction parallel corpus; the error correction parallel corpus is screened from the plurality of candidate error correction parallel corpora by the following method: aiming at each candidate error correction parallel corpus, respectively acquiring the context environment of the correct word and the wrong word in the candidate error correction parallel corpus; determining whether the candidate error correction parallel corpus is a noise corpus according to the context environment, wherein the determining whether the candidate error correction parallel corpus is the noise corpus according to the context environment comprises: respectively counting word frequencies of correct words in the candidate error correction parallel corpus in the context environment and word frequencies of wrong words in the context environment; determining that the candidate error correction parallel corpus is a noise corpus under the condition that the word frequency of the correct word in the context is smaller than a preset multiple of the word frequency of the incorrect word in the context; and acquiring the error correction parallel corpus by deleting the noise corpus from the plurality of candidate error correction parallel corpora.
9. A storage medium comprising a stored program, wherein the program, when run, controls a device in which the storage medium is located to perform the parallel corpus processing method of any one of claims 1 to 7.
10. A computer device, comprising: a memory and a processor, wherein the memory is configured to store,
the memory stores a computer program;
the processor is configured to execute a computer program stored in the memory, and when the computer program is executed, the parallel corpus processing method according to any one of claims 1 to 7 is executed.
CN201811481225.8A 2018-12-05 2018-12-05 Parallel corpus processing method and device, storage medium and computer equipment Active CN111353025B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811481225.8A CN111353025B (en) 2018-12-05 2018-12-05 Parallel corpus processing method and device, storage medium and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811481225.8A CN111353025B (en) 2018-12-05 2018-12-05 Parallel corpus processing method and device, storage medium and computer equipment

Publications (2)

Publication Number Publication Date
CN111353025A CN111353025A (en) 2020-06-30
CN111353025B true CN111353025B (en) 2024-02-27

Family

ID=71195270

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811481225.8A Active CN111353025B (en) 2018-12-05 2018-12-05 Parallel corpus processing method and device, storage medium and computer equipment

Country Status (1)

Country Link
CN (1) CN111353025B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112560452B (en) * 2021-02-25 2021-05-18 智者四海(北京)技术有限公司 Method and system for automatically generating error correction corpus
CN113204966B (en) * 2021-06-08 2023-03-28 重庆度小满优扬科技有限公司 Corpus augmentation method, apparatus, device and storage medium
CN113822044B (en) * 2021-09-29 2023-03-21 深圳市木愚科技有限公司 Grammar error correction data generating method, device, computer equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11328317A (en) * 1998-05-11 1999-11-30 Nippon Telegr & Teleph Corp <Ntt> Method and device for correcting japanese character recognition error and recording medium with error correcting program recorded
CN102915314A (en) * 2011-08-05 2013-02-06 腾讯科技(深圳)有限公司 Automatic error correction pair generation method and system
CN105975625A (en) * 2016-05-26 2016-09-28 同方知网数字出版技术股份有限公司 Chinglish inquiring correcting method and system oriented to English search engine
CN106919681A (en) * 2017-02-28 2017-07-04 东软集团股份有限公司 The error correction method and device of wrong word
CN106959977A (en) * 2016-01-12 2017-07-18 广州市动景计算机科技有限公司 Candidate collection computational methods and device, word error correction method and device in word input
CN107977357A (en) * 2017-11-22 2018-05-01 北京百度网讯科技有限公司 Error correction method, device and its equipment based on user feedback
CN108717412A (en) * 2018-06-12 2018-10-30 北京览群智数据科技有限责任公司 Chinese check and correction error correction method based on Chinese word segmentation and system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11328317A (en) * 1998-05-11 1999-11-30 Nippon Telegr & Teleph Corp <Ntt> Method and device for correcting japanese character recognition error and recording medium with error correcting program recorded
CN102915314A (en) * 2011-08-05 2013-02-06 腾讯科技(深圳)有限公司 Automatic error correction pair generation method and system
CN106959977A (en) * 2016-01-12 2017-07-18 广州市动景计算机科技有限公司 Candidate collection computational methods and device, word error correction method and device in word input
CN105975625A (en) * 2016-05-26 2016-09-28 同方知网数字出版技术股份有限公司 Chinglish inquiring correcting method and system oriented to English search engine
CN106919681A (en) * 2017-02-28 2017-07-04 东软集团股份有限公司 The error correction method and device of wrong word
CN107977357A (en) * 2017-11-22 2018-05-01 北京百度网讯科技有限公司 Error correction method, device and its equipment based on user feedback
CN108717412A (en) * 2018-06-12 2018-10-30 北京览群智数据科技有限责任公司 Chinese check and correction error correction method based on Chinese word segmentation and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
沈健.基于统计模型的搜索引擎查询纠错系统.《中国优秀硕士学问论文全文数据库》.2018,(第undefined期),全文. *

Also Published As

Publication number Publication date
CN111353025A (en) 2020-06-30

Similar Documents

Publication Publication Date Title
CN106534548B (en) Voice error correction method and device
CN106874441B (en) Intelligent question-answering method and device
WO2019084867A1 (en) Automatic answering method and apparatus, storage medium, and electronic device
CN111353025B (en) Parallel corpus processing method and device, storage medium and computer equipment
CN113590645B (en) Searching method, searching device, electronic equipment and storage medium
CN107193974B (en) Regional information determination method and device based on artificial intelligence
CN111310440B (en) Text error correction method, device and system
CN104156454A (en) Search term correcting method and device
US20180217674A1 (en) Stroke input method, device and system
CN108595412B (en) Error correction processing method and device, computer equipment and readable medium
CN111324705A (en) System and method for adaptively adjusting related search terms
CN111651674B (en) Bidirectional searching method and device and electronic equipment
CN111222328A (en) Label extraction method and device and electronic equipment
CN114550157A (en) Bullet screen gathering identification method and device
CN113139039A (en) Dialogue data processing method, device, equipment and storage medium
CN114064845A (en) Method and device for training relational representation model and electronic equipment
CN112115237A (en) Method and device for constructing tobacco scientific and technical literature data recommendation model
CN111598550A (en) Mail signature information extraction method, device, electronic equipment and medium
CN115035510B (en) Text recognition model training method, text recognition device, and medium
CN110929508B (en) Word vector generation method, device and system
CN113139096B (en) Video dataset labeling method and device
CN110175234B (en) Unknown word recognition method and device, computer equipment and storage medium
CN112101016B (en) Word segmentation device obtaining method and device and electronic equipment
CN103885669B (en) Cloud candidate input method and mobile terminal
CN111401011B (en) Information processing method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant