CN117033575A

CN117033575A - Mixed word detection method, device, electronic equipment and readable storage medium

Info

Publication number: CN117033575A
Application number: CN202310828980.3A
Authority: CN
Inventors: 陈劲
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Priority date: 2023-07-07
Filing date: 2023-07-07
Publication date: 2023-11-10

Abstract

The application discloses a mixed word detection method, a mixed word detection device, electronic equipment and a readable storage medium, and belongs to the technical field of data processing. Wherein the method comprises the following steps: acquiring a word to be detected, a prefix word list library and a suffix word list library; respectively carrying out similarity calculation on the words to be detected, the prefix words in the prefix vocabulary library and the suffix words in the suffix vocabulary library; and determining whether the word to be detected is a mixed word according to the similarity calculation result. The scheme provided by the application can solve the problem of low success rate of mixed word detection and matching.

Description

Mixed word detection method, device, electronic equipment and readable storage medium

Technical Field

The application belongs to the technical field of data processing, and particularly relates to a mixed word detection method, a mixed word detection device, electronic equipment and a readable storage medium.

Background

With the development and popularity of internet social media, a new mixed word formed by recombination of two existing words often appears. At present, the identification of the English mixed words is mainly realized by means of an established continuously updated mixed word text library, and query matching can be performed in the mixed word text library when the mixed words are encountered. Because the matching needs to be performed by means of the content in the mixed word text library, when some mixed words which are not present in the mixed word text library are encountered, the accuracy of mixed word detection matching is lower.

Disclosure of Invention

The embodiment of the application aims to provide a mixed word detection method, a mixed word detection device and electronic equipment, which can solve the problem of video picture screen display in the related technology.

In a first aspect, an embodiment of the present application provides a mixed word detection method, where the method includes:

acquiring a word to be detected, a prefix word list library and a suffix word list library;

respectively carrying out similarity calculation on the words to be detected, the prefix words in the prefix vocabulary library and the suffix words in the suffix vocabulary library;

and determining whether the word to be detected is a mixed word according to the similarity calculation result.

Optionally, the obtaining the prefix vocabulary library and the suffix vocabulary library includes:

obtaining a mixed word library, wherein the mixed word library comprises K mixed words, and K is an integer greater than 1;

splitting the K mixed words into K first prefix words and K first suffix words based on a dictionary tree data structure;

constructing the prefix word list library based on the K first prefix words;

and constructing the suffix word list library based on the K first suffix words.

Optionally, after similarity calculation is performed on the to-be-detected word and the prefix word in the prefix vocabulary library and the suffix word in the suffix vocabulary library, the method further includes:

obtaining M target prefix words similar to the words to be detected in the prefix word list library, and obtaining N target suffix words similar to the words to be detected in the suffix word list, wherein M, N is an integer greater than or equal to 1.

Optionally, the obtaining M target prefix words in the prefix vocabulary similar to the word to be detected, and obtaining N target suffix words in the suffix vocabulary similar to the word to be detected, includes:

according to a similarity matching algorithm, performing similarity comparison on the to-be-detected words and the prefix words in the prefix word list library, and obtaining a first similarity result corresponding to each prefix word in the prefix word list library;

according to a similarity matching algorithm, performing similarity comparison on the to-be-detected words and the suffix words in the suffix word list library, and obtaining second similarity results corresponding to each suffix word in the suffix word list library;

determining a first prefix word sequence based on the first similarity result corresponding to each prefix word in the prefix word list library, wherein the first prefix word sequence comprises the M target prefix words which are arranged from high to low according to the first similarity result;

and determining a second suffix word sequence based on the second similarity result corresponding to each suffix word in the suffix word list library, wherein the second suffix word sequence comprises the N target suffix words which are arranged in a sequence from high to low according to the second similarity result.

Optionally, the determining, according to the similarity calculation result, whether the word to be detected is a mixed word includes:

acquiring the M target prefix words and the N target suffix words based on the similarity calculation result;

splicing the M target prefix words and the N target suffix words to obtain S target mixed words, wherein S is the product of M and N;

acquiring average editing distances between the S target mixed words and the words to be detected;

and judging whether the word to be detected is a mixed word or not based on the average editing distance.

Optionally, the obtaining the average edit distance between the S target mixed words and the word to be detected includes:

acquiring the character number of the word to be detected;

acquiring S first editing distances between the S target mixed words and the words to be detected respectively;

based on each first editing distance and the number of characters, S second editing distances are obtained;

and taking the average value of the S second editing distances as the average editing distance.

Optionally, the determining whether the word is a mixed word based on the average editing distance includes:

judging whether the average editing distance is smaller than or equal to a preset threshold value;

and under the condition that the average editing distance is smaller than or equal to the preset threshold value, determining that the word to be detected is a mixed word.

In a second aspect, an embodiment of the present application provides a mixed word detection apparatus, including:

the acquisition module is used for acquiring words to be detected, a prefix word list library and a suffix word list library;

the processing module is used for carrying out similarity calculation on the words to be detected, the prefix words in the prefix word list library and the suffix words in the suffix word list library respectively;

and the judging module is used for determining whether the word to be detected is a mixed word according to the similarity calculation result.

In a third aspect, an embodiment of the present application provides an electronic device, where the electronic device includes a processor, a memory, and a program or an instruction stored on the memory and executable on the processor, where the program or the instruction is executed by the processor to implement the steps of the mixed word detection method according to the first aspect.

In a fourth aspect, embodiments of the present application provide a readable storage medium having stored thereon a program or instructions which, when executed by a processor, implement the steps of the mixed word detection method as described in the first aspect.

In the embodiment of the application, based on a prefix word list library and a suffix word list library which are acquired in advance, similarity calculation is carried out on the words to be detected and the prefix words in the prefix word list library and the suffix words in the suffix word list library respectively, a plurality of prefix words and suffix words which have higher similarity with the words to be detected are determined based on a similarity calculation result, and then, whether the words similar to the words to be detected exist or not is determined by splicing the prefix words and the suffix words, so that whether the words to be detected are mixed words or not is determined. Therefore, a large number of mixed words do not need to be stored in a text library, the requirement on a storage space is reduced, the recognition problem of the words to be detected can be recognized, and the accuracy of recognition and matching of the words to be detected is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort to a person of ordinary skill in the art.

FIG. 1 is a schematic flow chart of a mixed word detection method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of the process of obtaining the prefix vocabulary library and the suffix vocabulary library of FIG. 1;

FIG. 3 is a schematic flow chart of FIG. 1 for determining whether the word to be detected is a mixed word;

FIG. 4 is a cumulative distribution diagram of S first edit distances of S target mixed words and words to be detected;

FIG. 5 is a cumulative distribution diagram of S second edit distances of S target mixed words and words to be detected;

fig. 6 is a schematic structural diagram of a mixed word detecting device according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions of the embodiments of the present application will be clearly described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which are obtained by a person skilled in the art based on the embodiments of the present application, fall within the scope of protection of the present application.

The terms first, second and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged, as appropriate, such that embodiments of the present application may be implemented in sequences other than those illustrated or described herein, and that the objects identified by "first," "second," etc. are generally of a type, and are not limited to the number of objects, such as the first object may be one or more. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/", generally means that the associated object is an "or" relationship.

The method, the device, the electronic equipment and the readable storage medium for detecting mixed words provided by the embodiment of the application are described in detail below through specific embodiments and application scenes thereof with reference to the accompanying drawings.

Referring to fig. 1, fig. 1 is a flowchart of a mixed word detection method according to an embodiment of the application.

As shown in fig. 1, the mixed word detection method includes the following steps:

step 101, obtaining a word to be detected, a prefix word list library and a suffix word list library.

It should be noted that the word to be detected in the present application may be an english mixed word to be detected, for example, the english word "Brunch" such as "Breakfast" is composed of the words Breakfast "and Lunch". The word to be detected can be a Chinese mixed word to be detected or a Chinese-English mixed word, and the application is not particularly limited to the word to be detected.

Specifically, the obtaining the prefix vocabulary library and the suffix vocabulary library may specifically be splitting the mixed words in the prefix vocabulary library and the suffix vocabulary library through the existing mixed vocabulary library, classifying the prefix words and the suffix words, and respectively classifying the prefix words and the suffix words into the prefix vocabulary library and the suffix vocabulary library for determination. Or based on the existing root, dividing the root into prefix root to form a prefix word list base, and forming the suffix root into a suffix word list base. Through the division of the prefix word list library and the suffix word list library, the storage of repeated contents can be reduced, and the storage efficiency is improved.

And 102, respectively carrying out similarity calculation on the words to be detected, the prefix words in the prefix word list library and the suffix words in the suffix word list library.

It can be understood that similarity calculation is performed on the to-be-detected words and the prefix words in the prefix vocabulary library, and similarity calculation is performed on the to-be-detected words and the prefix words in the suffix vocabulary library. The similarity calculation can determine prefix words with higher similarity to the words to be detected in the prefix word list library and suffix words with higher similarity to the words to be detected in the suffix word list library. The prefix words can be ranked according to the similarity result of the prefix words and the words to be detected in the prefix word list library, and a plurality of prefix words with higher similarity are determined based on the ranking. Similarly, a plurality of suffix words having a high similarity may be determined. The application is not limited to the selected prefix words and suffix words and the number and order.

Further, the similarity calculation may specifically be a similarity algorithm, which may be a string similarity comparison (e.g. Jaro-Winkler similarity) algorithm, or may be another algorithm capable of determining a similarity between a word to be detected and another word.

Step 103, determining whether the word to be detected is a mixed word according to the similarity calculation result.

In the embodiment of the application, based on the similarity calculation result, whether the word to be detected is a mixed word is determined. The similarity calculation result may be a plurality of prefix words and suffix words similar to the word to be detected, which are determined based on the similarity algorithm. And splicing the determined prefix words and suffix words to generate a plurality of complete mixed words, then respectively determining Edit distances (Edit distances) between the generated mixed words and the words to be detected, carrying out normalization processing on all obtained Edit distances to determine average Edit distances, wherein the average Edit distances can be used as indexes for judging whether the words to be detected are the mixed words. And determining whether the word to be detected is a mixed word or not through comparison of the average editing distance and a preset threshold value. Therefore, the influence of the character string length of different spliced words on the words to be detected can be reduced through normalization processing, the words to be detected which are non-mixed words are effectively filtered, and the success rate of mixed word detection is improved.

Optionally, as shown in fig. 2, the obtaining a prefix vocabulary library and a suffix vocabulary library includes:

step 201, obtaining a mixed word stock, wherein the mixed word stock comprises K mixed words, and K is an integer greater than 1;

step 202, splitting the K mixed words into K first prefix words and K first suffix words based on a dictionary tree data structure;

step 203, constructing the prefix word list library based on the K first prefix words;

and 204, constructing the suffix word list library based on the K first suffix words.

The prefix word list library and the suffix word list library are obtained by splitting mixed words in the mixed word library, prefix words and suffix words of the existing mixed words are obtained, the prefix words and the suffix words are combined into the prefix word list library, and the suffix words are combined into the suffix word list library. Specifically, the existing mixed words may be split by using a dictionary Trie data structure, and other manners may be used to split the mixed words, which is not limited herein. Through the determination of the prefix word list library and the suffix word list library, the repetition of the prefix word and the suffix word can be reduced, and the content repetition of the prefix word or the suffix word can be reduced, the storage space is saved, and the storage efficiency is improved in the induction of the prefix word list library or the suffix word list library which are different from each other.

In specific implementation, based on a result of similarity calculation, prefix words similar to the word to be detected in the prefix word list library are determined to be target prefix words, and suffix words similar to the word to be detected in the suffix word list library are determined to be target suffix words. The number of the target prefix words and the target suffix words is not particularly limited, and the number of the target prefix words and the number of the target suffix words can be the same, or can be determined according to prefix characteristics and suffix characteristics of the words to be detected, for example, the prefixes of the words to be detected are shorter and simpler, the suffixes are more complex, the number of the target prefix words can be set to be smaller than the number of the target suffix words, and the number of the target prefix words and the number of the target suffix words can be set to be consistent. By determining the target prefix word and the target suffix word which are respectively similar to the prefix and the suffix of the word to be detected, the recognition and detection process of the word to be detected is simplified, the storage space is saved, and the detection efficiency is improved.

It can be appreciated that the target prefix word with higher similarity to the word to be detected in the prefix word list library can be determined through a character string similarity matching algorithm (e.g., a Jaro-Winkler similarity) algorithm, and the target suffix word with higher similarity to the word to be detected in the suffix word list library can be determined. According to the first similarity result of the prefix words, the first prefix words are arranged from high to low, according to the second similarity result of the suffix words are arranged from low to high, the target prefix words can be M pieces with highest similarity determined in the arranged sequence, and the target suffix words can be N pieces with highest similarity determined in the arranged sequence. The target prefix word may be consecutive prefix words determined in the ranked sequence, and the target suffix word may be consecutive suffix words determined in the ranked sequence. Therefore, the target prefix word and the target suffix word with the highest similarity with the word to be detected are determined, and the word to be detected is identified and detected, so that the detection efficiency can be improved, and the detection accuracy can be improved.

Optionally, as shown in fig. 3, the determining, according to the similarity calculation result, whether the word to be detected is a mixed word includes:

step 301, obtaining the M target prefix words and the N target suffix words based on the similarity calculation result;

step 302, splicing the M target prefix words and the N target suffix words to obtain S target mixed words, wherein S is the product of M and N;

step 303, obtaining average editing distances between the S target mixed words and the words to be detected;

and step 304, judging whether the word to be detected is a mixed word or not based on the average editing distance.

In the embodiment of the application, after the target prefix word and the target suffix word are determined, each target prefix word and each target suffix word can be spliced, and specifically, cartesian product splicing can be adopted to obtain target mixed words of M and N products. The target mixed words can be compared with the words to be detected, the editing distance between the words to be detected and each target mixed word can be determined, and then the average editing distance is determined based on the editing distance and is used for determining whether the words to be detected are mixed words or not. The method has the advantages that the target mixed word formed by splicing the word to be detected and the closest target prefix word and the target suffix word is compared, the editing distance is determined, and whether the word to be detected is the mixed word or not is judged based on the similarity of the word to be detected and the target mixed word, so that the accuracy of detection can be effectively improved, the dependence on the existing word samples in a mixed word stock is reduced, and the requirement on the storage space is reduced.

acquiring the character number of the word to be detected;

Specifically, determining the average drama distance between the target mixed word and the word to be detected can be performed through normalization processing. The normalization process may be to divide the first edit distance between each target mixed word and the word to be detected by the number of characters of the target mixed word to obtain a second edit distance. The influence of different character number lengths on the editing distance is removed from the second editing distance, so that the word to be detected can be accurately identified. Subsequently, an average of each second edit distance is determined as an average edit distance. For example, referring to fig. 4 and 5, a cumulative distribution function (cumulative distribution function, CDF) may be introduced as an ordinate and a graph edit distance (Graph Edit Distance, GED) as an abscissa to count the first edit distance and the second edit distance, fig. 4 is a cumulative distribution function diagram of S target mixed words and S first edit distances of the word to be detected, fig. 5 is a cumulative distribution diagram of S target mixed words and S second edit distances of the word to be detected, the data in fig. 5 removes the influence of different character string lengths of the target mixed words, and is more accurate relative to the data in fig. 4, and the determined average edit distances of the word to be detected is more accurate.

In one embodiment of the present application, whether the word to be detected is a mixed word is determined based on the average edit distance, and the word to be detected may be determined as the mixed word by comparing the average edit distance with a preset threshold value, and when the average edit distance is less than or equal to the preset threshold value, it is indicated that the word to be detected is highly likely to be the mixed word. The preset threshold may be obtained by training a mixed word detection model using mixed words as training samples, and the accumulated distribution relationship formed by combining all trained mixed words with their corresponding source words is determined according to the training result of the mixed word detection model, for example, in fig. 5, the preset threshold may be determined to be 0.6, and when the preset threshold is 0.6, the accuracy of word recognition to be detected may reach more than 93%, and non-mixed words in the target mixed words may be effectively filtered out.

In another embodiment of the present application, the mixed word detection method may be applied to text detection for detecting whether text content contains mixed words. In particular, the text content to be identified may be split in units of a single vocabulary. And then performing similarity calculation on each word in the text and the prefix words in the prefix word list library, and performing similarity calculation on each word through the suffix words in the suffix word list library. Based on the similarity calculation result, determining a prefix word and a suffix word corresponding to each word, combining and splicing the prefix word and the suffix word to obtain a plurality of mixed words, calculating the editing distance corresponding to each word in the detection text by adopting an editing distance algorithm through each word and the plurality of mixed words corresponding to each word, and determining the mixed words possibly existing in the detection text by combining the editing distance. Therefore, the efficiency and recall rate of searching the mixed words in the text are effectively improved, the storage space can be effectively saved, the mixed words can be updated adaptively, and the success rate of detecting and identifying the mixed words is improved.

Referring to fig. 6, an embodiment of the present application provides a mixed word detecting apparatus 400, including:

an obtaining module 401, configured to obtain a word to be detected, a prefix vocabulary library, and a suffix vocabulary library;

the processing module 402 is configured to perform similarity calculation on the word to be detected and a prefix word in the prefix vocabulary library and a suffix word in the suffix vocabulary library respectively;

and the judging module 403 is configured to determine whether the word to be detected is a mixed word according to the similarity calculation result.

Optionally, the obtaining module 401 includes:

the mixed word stock module is used for obtaining a mixed word stock, wherein the mixed word stock comprises K mixed words, and K is an integer greater than 1;

the splitting module is used for splitting the K mixed words into K first prefix words and K first suffix words based on the dictionary tree data structure;

the first construction module is used for constructing the prefix word list library based on the K first prefix words;

and the second construction module is used for constructing the suffix word list library based on the K first suffix words.

Optionally, the processing module 402 is further configured to:

Optionally, the determining module 403 includes:

the similarity result module is used for acquiring the M target prefix words and the N target suffix words based on the similarity calculation result;

the mixed word splicing module is used for splicing the M target prefix words and the N target suffix words to obtain S target mixed words, wherein S is the product of M and N;

the average editing distance module is used for acquiring the average editing distance between the S target mixed words and the words to be detected;

and the mixed word judging module is used for judging whether the word to be detected is a mixed word or not based on the average editing distance.

Optionally, the average edit distance module is configured to:

acquiring the character number of the word to be detected;

Optionally, the mixed word judging module is configured to:

The mixed word detection device 400 provided in the embodiment of the present application can implement each process implemented by the embodiment of the method described in fig. 1, and can achieve the same beneficial effects, so that repetition is avoided, and no further description is provided here.

Referring to fig. 7, fig. 7 is a block diagram of an electronic device according to an embodiment of the present application, as shown in fig. 7, including: a processor 501, a memory 502 and a program or instruction stored on the memory 502 and executable on the processor 501, the processor 501 being configured to read the program or instruction in the memory 502; the electronic device further comprises a bus interface and a transceiver 503.

Wherein in fig. 7, a bus architecture may comprise any number of interconnected buses and bridges, and in particular one or more processors represented by processor 501 and various circuits of memory represented by memory 502, linked together. The bus architecture may also link together various other circuits such as peripheral devices, voltage regulators, power management circuits, etc., which are well known in the art and, therefore, will not be described further herein. The bus interface provides an interface. The transceiver 503 may be a number of elements, including a transmitter and a transceiver, providing a means for communicating with various other apparatus over a transmission medium. The processor 501 is responsible for managing the bus architecture and general processing, and the memory 502 may store data used by the processor 501 in performing operations.

The processor 501 is configured to read a program or an instruction in the memory 502, and perform the following steps:

Optionally, the processor 501 is configured to read a program or an instruction in the memory 502, and perform the following steps:

constructing the prefix word list library based on the K first prefix words;

acquiring the character number of the word to be detected;

The electronic device provided by the embodiment of the application can realize each process realized by the embodiment of the method shown in fig. 1, and can achieve the same beneficial effects, and in order to avoid repetition, the description is omitted here.

The embodiment of the present application further provides a readable storage medium, where a program or an instruction is stored, and when the program or the instruction is executed by a processor, the processes of the embodiment of the mixed word detection method described in fig. 1 are implemented, and the same technical effects can be achieved, so that repetition is avoided, and no further description is given here.

Wherein the processor is a processor in the electronic device described in the above embodiment. The readable storage medium includes a computer readable storage medium such as a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk or an optical disk, and the like.

The embodiment of the application further provides a chip, which comprises a processor and a communication interface, wherein the communication interface is coupled with the processor, and the processor is used for running a program or instructions to realize the processes of the embodiment of the mixed word detection method described in fig. 1, and the same technical effects can be achieved, so that repetition is avoided, and the description is omitted here.

It should be understood that the chips referred to in the embodiments of the present application may also be referred to as system-on-chip chips, chip systems, or system-on-chip chips, etc.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Furthermore, it should be noted that the scope of the methods and apparatus in the embodiments of the present application is not limited to performing the functions in the order shown or discussed, but may also include performing the functions in a substantially simultaneous manner or in an opposite order depending on the functions involved, e.g., the described methods may be performed in an order different from that described, and various steps may be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a computer software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present application.

The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are to be protected by the present application.

Claims

1. A mixed word detection method, the method comprising:

2. The method of claim 1, wherein the obtaining a prefix vocabulary library and a suffix vocabulary library comprises:

constructing the prefix word list library based on the K first prefix words;

3. The method of claim 1, wherein after performing similarity calculation on the to-be-detected word and the prefix word in the prefix vocabulary library and the suffix word in the suffix vocabulary library, the method further comprises:

4. The method of claim 3, wherein the obtaining M target prefix words in the prefix vocabulary that are similar to the word to be detected and obtaining N target suffix words in the suffix vocabulary that are similar to the word to be detected comprises:

5. The method according to claim 3 or 4, wherein the determining whether the word to be detected is a mixed word according to the similarity calculation result includes:

6. The method of claim 5, wherein the obtaining the average edit distance of the S target mixed words and the word to be detected comprises:

acquiring the character number of the word to be detected;

7. The method of any one of claims 5, wherein the determining whether the mixed word is based on the average edit distance comprises:

8. A mixed word detection device, comprising:

9. An electronic device comprising a processor, a memory and a program or instruction stored on the memory and executable on the processor, which when executed by the processor, implements the steps of the mixed word detection method of any one of claims 1-7.

10. A readable storage medium, wherein a program or instructions is stored on the readable storage medium, which when executed by a processor, implements the steps of the mixed word detection method according to any one of claims 1-7.