WO2022156180A1

WO2022156180A1 - Similar text determination method and related device

Info

Publication number: WO2022156180A1
Application number: PCT/CN2021/109391
Authority: WO
Inventors: 李小娟
Original assignee: 深圳壹账通智能科技有限公司
Priority date: 2021-01-19
Filing date: 2021-07-29
Publication date: 2022-07-28
Also published as: CN112395886A; CN112395886B

Abstract

The present application relates to artificial intelligence, and provides a similar text determination method and a related device. The method comprises: determining a text to be detected and a target text; generating a feature vector to be detected and a target feature vector; calculating the similarity between the feature vector to be detected and the target feature vector; determining a similarity coefficient and polarity features; generating text features according to the text similarity, the similarity coefficient, and the polarity features; converting the text to be detected into a semantic vector to be detected, and converting the target text into a target semantic vector; generating semantic features of the text to be detected and the target text; and determining a similarity result according to the text features and the semantic features. The present application can improve the accuracy of determining a similar text. In addition, the present application further relates to blockchain technology, and the similar result may be stored in a blockchain.

Description

Similar text determination method and related equipment

This application claims the priority of the Chinese patent application filed on January 19, 2021 with the application number 202110071000.0 and the invention titled "Method for Determining Similar Texts and Related Equipment", the entire contents of which are incorporated into this application by reference .

technical field

The present application relates to the technical field of artificial intelligence, and in particular, to a method for determining similar texts and related devices.

Background technique

At present, in the traditional unsupervised text similarity algorithm, the similarity of sentences is determined by the co-occurrence information of the text. However, if there are words with synonyms or synonyms in the text, it cannot be accurately calculated. The similarity between two texts reduces the accuracy of determining similar texts. In order to overcome the above shortcomings, deep text similarity algorithms are generated. In the current deep text similarity algorithms, sentences are mapped to semantics through the coding layer. However, the inventor realized that if there are texts with similar text information but opposite meanings, the determination accuracy of similar texts will be low.

SUMMARY OF THE INVENTION

In view of the above content, it is necessary to provide a similar text determination method and related equipment, which can improve the determination accuracy of similar texts.

A first aspect of the present application provides a method for determining similar texts, the method for determining similar texts includes:

receiving a similar text determination request, and determining the text to be detected according to the similar text determination request;

obtain the target text from the similar text determination request;

Perform word segmentation processing on the text to be detected to obtain multiple word segmentations to be detected, and perform word segmentation processing on the target text to obtain multiple target word segmentations;

Obtaining the union of the plurality of word segments to be detected and the plurality of target word segments to obtain all word segments;

Generate a feature vector to be detected according to the plurality of word segmentations to be detected and the plurality of target word segmentations, and generate a target feature vector according to the plurality of word segmentations to be detected and the plurality of target word segmentations;

Calculate the similarity between the to-be-detected feature vector and the target feature vector, obtain the text similarity between the to-be-detected text and the target text, and compare the multiple to-be-detected word segmentations with the multiple target word segmentations. The intersection is determined as a co-occurring word;

Calculate the number of co-occurrences of the co-occurrence words, and calculate the total amount of the word segmentation of all the word segmentation;

Divide the number of co-occurrences by the total amount of word segmentation to obtain a similarity coefficient;

Determine the polarity feature of the text to be detected and the target text according to the tone of the text to be detected and the tone of the target text;

Generate text features of the text to be detected and the target text according to the text similarity, the similarity coefficient and the polarity feature;

Converting the text to be detected into a semantic vector to be detected, and converting the target text into a target semantic vector;

Generate semantic features of the text to be detected and the target text according to the semantic vector to be detected and the target semantic vector, and determine the text to be detected and the target text according to the text features and the semantic features similar results.

A second aspect of the present application provides an electronic device, the electronic device includes a processor and a memory, the processor is configured to execute computer-readable instructions stored in the memory to implement the following steps:

obtain the target text from the similar text determination request;

A third aspect of the present application provides a computer-readable storage medium on which at least one computer-readable instruction is stored, and the at least one computer-readable instruction is executed by a processor to implement the following steps:

obtain the target text from the similar text determination request;

A fourth aspect of the present application provides an apparatus for determining similar texts, and the apparatus for determining similar texts includes:

a determination unit, configured to receive a similar text determination request, and determine the text to be detected according to the similar text determination request;

an obtaining unit for obtaining the target text from the similar text determination request;

a generating unit, configured to perform word segmentation processing on the text to be detected to obtain a plurality of word segmentations to be detected, and perform word segmentation processing on the target text to obtain a plurality of target word segmentations;

The generating unit is also used to obtain the union of the plurality of word segmentations to be detected and the plurality of target word segmentations to obtain all word segmentations;

The generating unit is further configured to generate a feature vector to be detected according to the plurality of word segmentations to be detected and the plurality of target word segmentations, and to generate a target feature vector according to the plurality of word segmentations to be detected and the plurality of target word segmentations;

The determining unit is further configured to calculate the similarity between the to-be-detected feature vector and the target feature vector, obtain the text similarity between the to-be-detected text and the target text, and divide the plurality of to-be-detected word segmentations. The intersection with the multiple target word segments is determined as a co-occurrence word;

The determining unit is also used to calculate the co-occurrence quantity of the co-occurrence words, and calculate the total amount of the word segmentation of all the word segmentations;

The determining unit is further configured to divide the co-occurrence number by the total amount of word segmentation to obtain a similarity coefficient;

The determining unit is further configured to determine the polarity features of the text to be detected and the target text according to the tone of the text to be detected and the tone of the target text;

The generating unit is further configured to generate text features of the text to be detected and the target text according to the text similarity, the similarity coefficient and the polarity feature;

a conversion unit, for converting the text to be detected into a semantic vector to be detected, and converting the target text into a target semantic vector;

The determining unit is further configured to generate semantic features of the text to be detected and the target text according to the semantic vector to be detected and the target semantic vector, and determine the text feature and the semantic feature to determine the Similar results between the text to be detected and the target text.

It can be seen from the above technical solutions that the present application determines the text similarity, similarity coefficient and polarity feature of the text to be detected and the target text, because the polarity feature can characterize the text to be detected and the target text. Whether the tone of the target text is the same, therefore, the degree of similarity between the text to be detected and the target text can be accurately determined. The problem of low accuracy is solved, and the similarity result between the text to be detected and the target text can be accurately determined through the text feature and the semantic feature.

Description of drawings

FIG. 1 is a flowchart of a preferred embodiment of the method for determining similar texts in the present application.

FIG. 2 is a flowchart of an embodiment of the present application for generating a feature vector to be detected.

FIG. 3 is a flow chart of an embodiment of the present application for generating semantic features.

FIG. 4 is a functional block diagram of a preferred embodiment of the apparatus for determining similar texts of the present application.

FIG. 5 is a schematic structural diagram of an electronic device implementing a preferred embodiment of the method for determining similar texts in the present application.

Detailed ways

In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in detail below with reference to the accompanying drawings and specific embodiments.

As shown in FIG. 1 , it is a flowchart of a preferred embodiment of the method for determining similar texts of the present application. According to different requirements, the order of the steps in this flowchart can be changed, and some steps can be omitted.

The similar text determination method is applied to one or more electronic devices, the electronic device is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored computer-readable instructions, and its hardware Including but not limited to microprocessors, application specific integrated circuits (ASICs), programmable gate arrays (Field-Programmable Gate Arrays, FPGAs), digital processors (Digital Signal Processors, DSPs), embedded devices, etc.

The electronic device can be any electronic product that can interact with the user, such as a personal computer, a tablet computer, a smart phone, a personal digital assistant (PDA), a game console, an interactive network television ( Internet Protocol Television, IPTV), smart wearable devices, etc.

The electronic equipment may include network equipment and/or user equipment. Wherein, the network device includes, but is not limited to, a single network electronic device, an electronic device group composed of multiple network electronic devices, or a cloud composed of a large number of hosts or network electronic devices based on cloud computing (Cloud Computing).

The network where the electronic device is located includes, but is not limited to, the Internet, a wide area network, a metropolitan area network, a local area network, a virtual private network (Virtual Private Network, VPN), and the like.

S10: Receive a similar text determination request, and determine the text to be detected according to the similar text determination request.

In at least one embodiment of the present application, the information carried in the similar text determination request includes, but is not limited to: target text, storage location, and the like. The similar text determination request can be triggered by any user.

The text to be detected refers to the text that needs to be detected whether it is similar to the target text. There may be multiple texts to be detected.

In at least one embodiment of the present application, the electronic device determining the text to be detected according to the similar text determination request includes:

Parsing the similar text to determine the requested message, and obtaining data information carried by the message;

Obtain information for indicating a location from the data information as a storage location;

A to-be-detected text library is determined from the storage location, and any text is extracted from the to-be-detected text library as the to-be-detected text.

Through the above implementation manner, since it is not necessary to parse the entire similar text determination request, the efficiency of obtaining the storage location can be improved, and the text to be detected can be quickly obtained.

S11, obtain the target text from the similar text determination request.

In at least one embodiment of the present application, the target text refers to the reference text in the similar text determination request.

In at least one embodiment of the present application, the obtaining, by the electronic device, the target text from the similar text determination request includes:

Information for indicating text is acquired from the data information as the target text.

Through the above implementation manner, since the target text is stored in the similar text determination request, the target text can be quickly acquired from the data information obtained by parsing.

S12, generating a feature vector to be detected according to the text to be detected and the target text, and generating a target feature vector according to the text to be detected and the target text.

Referring to FIG. 2, FIG. 2 is a flowchart of an embodiment of the present application for generating a feature vector to be detected. In at least one embodiment of the present application, generating, by the electronic device, a feature vector to be detected according to the text to be detected and the target text includes:

S120: Perform word segmentation processing on the text to be detected to obtain multiple word segmentations to be detected, and perform word segmentation processing on the target text to obtain multiple target word segmentations.

The multiple word segments to be detected may be multiple words, and the multiple target word segments may be multiple words.

S121: Acquire the union of the multiple to-be-detected word segments and the multiple target word segments to obtain all the segmented words.

S122: Generate the feature vector to be detected according to the mapping relationship between the multiple word segments to be detected and all the word segments.

The mapping relationship refers to whether the multiple to-be-detected word segments exist in all the word segments.

For example, the multiple word segments to be detected are: I, Immediately, Immediately, Help, You, Apply, Please, Okay, Do, and the multiple target word segments are: I, No, Do, Fa, Help, You, Apply , please, therefore, all the participles mentioned are: I, help, apply, please, immediately, immediately, you, ok, ? The multiple word segments to be detected do not appear, therefore, the feature vector to be detected is [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0].

Through the above implementation, the feature vector to be detected can be determined according to the text to be detected and the target text. Since the feature vector to be detected is generated according to the target text, the feature vector to be detected can be accurately determined. Feature vector to be detected.

In at least one embodiment of the present application, the electronic device generating a target feature vector according to the text to be detected and the target text includes:

The target feature vector is generated according to the mapping relationship between the target word segment and all word segments.

S13: Calculate the similarity between the feature vector to be detected and the target feature vector, obtain the text similarity between the text to be detected and the target text, and determine a similarity coefficient according to the text to be detected and the target text .

In at least one embodiment of the present application, the electronic device uses a cosine similarity calculation formula to calculate the similarity between the feature vector to be detected and the target feature vector.

The specific cosine similarity calculation formula is as follows:

Among them, cosθ refers to the similarity between the feature vector to be detected and the target feature vector, n refers to the vector dimension of the feature vector to be detected and the target feature vector, i refers to the current vector dimension, x _i is refers to the feature vector to be detected, and y _i refers to the target feature vector.

The text similarity can be quickly determined through the cosine similarity calculation formula.

In at least one embodiment of the present application, the electronic device determining the similarity coefficient according to the text to be detected and the target text includes:

Determining the intersection of the plurality of word segments to be detected and the plurality of target word segments as co-occurring words;

The similarity coefficient is obtained by dividing the co-occurrence number by the total number of word segmentations.

Following the above example, the co-occurrence words are me, help, application, and request, the co-occurrence number of the co-occurrence words is calculated to be 4, and the total number of word segmentations of all the participles is calculated to be 13. After calculation, the similarity is obtained. The coefficient is

Through the above-mentioned embodiments, the similarity coefficient can be accurately determined according to the co-occurrence words of the text to be detected and the target text.

S14. Determine the polarity feature of the text to be detected and the target text according to the tone of the text to be detected and the tone of the target text.

In at least one embodiment of the present application, the polarity characteristic includes 1 or 0. When the tone of the text to be detected is in phase with the tone of the target text, the polarity feature is determined to be 1; when the tone of the text to be detected is different from the tone of the target text, the polarity feature is determined as 1. Sex characteristics were determined to be 0.

In at least one embodiment of the present application, the electronic device determining the polarity characteristics of the text to be detected and the target text according to the tone of the text to be detected and the tone of the target text includes:

Detecting whether the text to be detected contains a preset word to obtain a first detection result, and detecting whether the target text contains the preset word to obtain a second detection result, where the preset word is used to indicate a negative tone ;

Determine the first tone of the text to be detected according to the first detection result, and determine the second tone of the target text according to the second detection result;

If the first tone is the same as the second tone, determining the polarity feature as a first value; or

If the first tone is different from the second tone, the polarity feature is determined as a second value.

Wherein, the preset words include, but are not limited to: none, none, no.

Through the above-mentioned embodiments, the tone of the text to be detected and the target text can be accurately determined according to the preset words, and then the polarity feature can be accurately determined.

S15. Generate text features of the text to be detected and the target text according to the text similarity, the similarity coefficient, and the polarity feature.

In at least one embodiment of the present application, the text feature is obtained by splicing the text similarity, the similarity coefficient and the polarity feature.

For example, the text similarity is 0.4714, the similarity coefficient is 0.3077, and the polarity feature is 0. After splicing, the text feature obtained is [0.4714, 0.3077, 0].

S16: Convert the text to be detected into a semantic vector to be detected, and convert the target text into a target semantic vector.

In at least one embodiment of the present application, the semantic vector to be detected includes the semantics of the text to be detected, and the target semantic vector includes the semantics of the target text.

In at least one embodiment of the present application, the electronic device converting the text to be detected into a semantic vector to be detected includes:

converting the text to be detected into a sequence of word vectors;

Use forward long short-term memory network to perform feature extraction on the word vector sequence to obtain a first feature vector;

Use the reverse long short-term memory network to perform feature extraction on the word vector sequence to obtain a second feature vector;

Splicing the first feature vector and the second feature vector to obtain the to-be-detected semantic vector.

Through the above-mentioned embodiments, the generated semantic vector to be detected can have the contextual semantics of the text to be detected, and the accuracy of determination of the semantic vector to be detected can be improved.

S17, generating semantic features of the text to be detected and the target text according to the semantic vector to be detected and the target semantic vector, and determining the text to be detected and the text to be detected according to the text features and the semantic features Similar results for the target text.

It should be emphasized that, in order to further ensure the privacy and security of the above similar results, the above similar results can also be stored in a node of a blockchain.

In at least one embodiment of the present application, the similarity result includes that the text to be detected is similar to the target text, and the text to be detected is not similar to the target text.

Referring to FIG. 3 , FIG. 3 is a flowchart of an embodiment of generating semantic features of the present application. In at least one embodiment of the present application, the electronic device generating the semantic features of the text to be detected and the target text according to the semantic vector to be detected and the target semantic vector includes:

S170, subtract the target semantic vector from the to-be-detected semantic vector to obtain a difference vector.

S171, splicing the to-be-detected semantic vector, the target semantic vector, and the difference vector to obtain a spliced semantic vector.

S172: Perform iterative mapping on the spliced semantic vector by using a pre-built multi-layer hidden layer to obtain the semantic feature.

Through the above-mentioned embodiment, since the semantic feature is obtained according to the operation of the to-be-detected semantic vector and the target semantic vector, the semantic feature can include the to-be-detected text and the target text The semantics in the semantics improves the accuracy of the semantic features.

In at least one embodiment of the present application, determining, by the electronic device, according to the text feature and the semantic feature, the similarity result between the text to be detected and the target text includes:

Splicing the text feature and the semantic feature to obtain a target vector;

Input the target vector into a pre-built binary classification network to obtain the similar result.

Through the above implementation manner, since the similar results are determined by using the text features and the semantic features, the similar results can be accurately determined.

As shown in FIG. 4 , it is a functional block diagram of a preferred embodiment of the apparatus for determining similar texts of the present application. The similar text determination device 11 includes a determination unit 110 , an acquisition unit 111 , a generation unit 112 and a conversion unit 113 . The module/unit referred to in this application refers to a series of computer-readable instruction segments that can be acquired by the processor 13 and can perform fixed functions, and are stored in the memory 12 . In this embodiment, the functions of each module/unit will be described in detail in subsequent embodiments.

The determination unit 110 receives the similar text determination request, and determines the text to be detected according to the similar text determination request.

In at least one embodiment of the present application, the determining unit 110 determines the text to be detected according to the similar text determination request includes:

The obtaining unit 111 obtains the target text from the similar text determination request.

In at least one embodiment of the present application, the acquiring unit 111 acquiring the target text from the similar text determination request includes:

The generating unit 112 generates a feature vector to be detected according to the text to be detected and the target text, and generates a target feature vector according to the text to be detected and the target text.

In at least one embodiment of the present application, the generating unit 112 generates a feature vector to be detected according to the text to be detected and the target text, including:

Perform word segmentation processing on the text to be detected to obtain multiple word segmentations to be detected, and perform word segmentation processing on the target text to obtain multiple target word segmentations.

Obtain the union of the multiple to-be-detected word segments and the multiple target word segments to obtain all the segmented words.

The to-be-detected feature vector is generated according to the mapping relationship between the plurality of to-be-detected word segments and all of the word segments.

In at least one embodiment of the present application, the generating unit 112 generates a target feature vector according to the text to be detected and the target text, including:

The determining unit 110 calculates the similarity between the feature vector to be detected and the target feature vector, obtains the text similarity between the text to be detected and the target text, and according to the text to be detected and the target text Determine the similarity coefficient.

In at least one embodiment of the present application, the determining unit 110 uses a cosine similarity calculation formula to calculate the similarity between the feature vector to be detected and the target feature vector.

The specific cosine similarity calculation formula is as follows:

In at least one embodiment of the present application, the determining unit 110 determining the similarity coefficient according to the text to be detected and the target text includes:

The determining unit 110 determines the polarity characteristics of the text to be detected and the target text according to the tone of the text to be detected and the tone of the target text.

In at least one embodiment of the present application, the determining unit 110 determines the polarity features of the text to be detected and the target text according to the tone of the text to be detected and the tone of the target text, including:

Wherein, the preset words include, but are not limited to: none, none, no.

The generating unit 112 generates text features of the text to be detected and the target text according to the text similarity, the similarity coefficient and the polarity feature.

The converting unit 113 converts the text to be detected into a semantic vector to be detected, and converts the target text into a target semantic vector.

In at least one embodiment of the present application, the converting unit 113 converts the text to be detected into a semantic vector to be detected including:

converting the text to be detected into a sequence of word vectors;

The determining unit 110 generates the semantic features of the text to be detected and the target text according to the semantic vector to be detected and the target semantic vector, and determines the text to be detected and the target text according to the text features and the semantic features. similar results for the target text.

In at least one embodiment of the present application, the determining unit 110 generates the semantic features of the text to be detected and the target text according to the semantic vector to be detected and the target semantic vector, including:

Subtract the target semantic vector from the to-be-detected semantic vector to obtain a difference vector.

Splicing the to-be-detected semantic vector, the target semantic vector and the difference vector to obtain a spliced semantic vector.

The spliced semantic vector is iteratively mapped by using a pre-built multi-layer hidden layer to obtain the semantic feature.

In at least one embodiment of the present application, the determining unit 110 determines the similarity result between the text to be detected and the target text according to the text feature and the semantic feature, including:

Splicing the text feature and the semantic feature to obtain a target vector;

As shown in FIG. 5 , it is a schematic structural diagram of an electronic device implementing a preferred embodiment of the method for determining similar texts of the present application.

In one embodiment of the present application, the electronic device 1 includes, but is not limited to, a memory 12 , a processor 13 , and computer-readable instructions stored in the memory 12 and executable on the processor 13 , such as similar text determination programs.

Those skilled in the art can understand that the schematic diagram is only an example of the electronic device 1, and does not constitute a limitation on the electronic device 1, and may include more or less components than the one shown, or combine some components, or different Components, for example, the electronic device 1 may also include input and output devices, network access devices, buses, and the like.

The processor 13 may be a central processing unit (Central Processing Unit, CPU), or other general-purpose processors, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or the processor can also be any conventional processor, etc. The processor 13 is the computing core and control center of the electronic device 1, and uses various interfaces and lines to connect the entire electronic device. 1, and the operating system that executes the electronic device 1, as well as various installed applications, program codes, and the like.

Exemplarily, the computer-readable instructions may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 12 and executed by the processor 13 to Complete this application. The one or more modules/units may be a series of computer-readable instruction segments capable of accomplishing specific functions, and the computer-readable instruction segments are used to describe the execution process of the computer-readable instructions in the electronic device 1 . For example, the computer readable instructions may be divided into a determining unit 110 , an obtaining unit 111 , a generating unit 112 and a converting unit 113 .

The memory 12 can be used to store the computer-readable instructions and/or modules, and the processor 13 executes or executes the computer-readable instructions and/or modules stored in the memory 12 and invokes the computer-readable instructions and/or modules stored in the memory 12. The data in the electronic device 1 realizes various functions of the electronic device 1 . The memory 12 may mainly include a stored program area and a stored data area, wherein the stored program area may store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), and the like; the storage data area may Data and the like created according to the use of the electronic device are stored. The memory 12 may include non-volatile and volatile memory such as: hard disk, internal memory, plug-in hard disk, Smart Media Card (SMC), Secure Digital (SD) card, Flash memory card (Flash) Card), at least one disk storage device, flash memory device, or other storage device.

The memory 12 may be an external memory and/or an internal memory of the electronic device 1 . Further, the storage 12 may be a storage in physical form, such as a memory stick, a TF card (Trans-flash Card) and the like.

If the modules/units integrated in the electronic device 1 are implemented in the form of software functional units and sold or used as independent products, they may be stored in a computer-readable storage medium. The computer-readable storage medium may be a non-volatile storage medium or a volatile storage medium. Based on this understanding, the present application can implement all or part of the processes in the methods of the above embodiments, and can also be completed by instructing relevant hardware through computer-readable instructions, and the computer-readable instructions can be stored in a computer-readable storage medium. The computer-readable instructions, when executed by the processor, can implement the steps of the above-mentioned method embodiments.

Wherein, the computer-readable instructions include computer-readable instruction codes, and the computer-readable instruction codes may be in source code form, object code form, executable file, or some intermediate form, and the like. The computer-readable medium may include: any entity or device capable of carrying the computer-readable instruction code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM, Read-Only). Memory), random access memory (RAM, Random Access Memory).

The blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

1, the memory 12 in the electronic device 1 stores computer-readable instructions to implement a method for determining similar text, and the processor 13 can execute the computer-readable instructions to implement:

obtain the target text from the similar text determination request;

Generate a feature vector to be detected according to the text to be detected and the target text, and generate a target feature vector according to the text to be detected and the target text;

Calculate the similarity between the feature vector to be detected and the target feature vector, obtain the text similarity between the text to be detected and the target text, and determine a similarity coefficient according to the text to be detected and the target text;

Specifically, for the specific implementation method of the computer-readable instruction by the processor 13, reference may be made to the description of the relevant steps in the embodiment corresponding to FIG. 1 , which is not repeated here.

In the several embodiments provided in this application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative. For example, the division of the modules is only a logical function division, and there may be other division manners in actual implementation.

The computer-readable storage medium stores computer-readable instructions, wherein the computer-readable instructions are used to implement the following steps when executed by the processor 13:

obtain the target text from the similar text determination request;

Calculate the similarity of the feature vector to be detected and the target feature vector, obtain the text similarity of the text to be detected and the target text, and determine a similarity coefficient according to the text to be detected and the target text;

The modules described as separate components may or may not be physically separated, and components shown as modules may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

In addition, each functional module in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware, or can be implemented in the form of hardware plus software function modules.

Accordingly, the embodiments are to be regarded in all respects as illustrative and not restrictive, and the scope of the application is to be defined by the appended claims rather than the foregoing description, which is therefore intended to fall within the scope of the claims. All changes within the meaning and scope of the equivalents of , are included in this application. Any reference signs in the claims shall not be construed as limiting the involved claim.

Furthermore, it is clear that the word "comprising" does not exclude other units or steps and the singular does not exclude the plural. The multiple units or devices described may also be implemented by one unit or device through software or hardware. The words first, second, etc. are used to denote names and do not denote any particular order.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present application rather than limitations. Although the present application has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the present application can be Modifications or equivalent substitutions can be made without departing from the spirit and scope of the technical solutions of the present application.

Claims

A similar text determination method, wherein the similar text determination method comprises:

receiving a similar text determination request, and determining the text to be detected according to the similar text determination request;

obtain the target text from the similar text determination request;

Perform word segmentation processing on the text to be detected to obtain multiple word segmentations to be detected, and perform word segmentation processing on the target text to obtain multiple target word segmentations;

Obtaining the union of the plurality of word segments to be detected and the plurality of target word segments to obtain all word segments;

Generate a feature vector to be detected according to the plurality of word segmentations to be detected and the plurality of target word segmentations, and generate a target feature vector according to the plurality of word segmentations to be detected and the plurality of target word segmentations;

Calculate the similarity between the to-be-detected feature vector and the target feature vector, obtain the text similarity between the to-be-detected text and the target text, and compare the multiple to-be-detected word segmentations with the multiple target word segmentations. The intersection is determined as a co-occurring word;

Calculate the number of co-occurrences of the co-occurrence words, and calculate the total amount of the word segmentation of all the word segmentation;

Divide the number of co-occurrences by the total amount of word segmentation to obtain a similarity coefficient;

Determine the polarity feature of the text to be detected and the target text according to the tone of the text to be detected and the tone of the target text;

Generate text features of the text to be detected and the target text according to the text similarity, the similarity coefficient and the polarity feature;

Converting the text to be detected into a semantic vector to be detected, and converting the target text into a target semantic vector;

Generate semantic features of the text to be detected and the target text according to the semantic vector to be detected and the target semantic vector, and determine the text to be detected and the target text according to the text features and the semantic features similar results.
The method for determining similar texts according to claim 1, wherein the determining the text to be detected according to the similar text determination request comprises:

Parsing the similar text to determine the requested message, and obtaining data information carried by the message;

Obtain information for indicating a location from the data information as a storage location;

A to-be-detected text library is determined from the storage location, and any text is extracted from the to-be-detected text library as the to-be-detected text.
The method for determining similar texts according to claim 1, wherein the generating a feature vector to be detected according to the plurality of word segments to be detected and the plurality of target word segments comprises:

The to-be-detected feature vector is generated according to the mapping relationship between the plurality of to-be-detected word segments and all of the word segments.
The method for determining similar texts according to claim 1, wherein the determining the polarity features of the text to be detected and the target text according to the tone of the text to be detected and the tone of the target text comprises:

Detecting whether the text to be detected contains a preset word to obtain a first detection result, and detecting whether the target text contains the preset word to obtain a second detection result, where the preset word is used to indicate a negative tone ;

Determine the first tone of the text to be detected according to the first detection result, and determine the second tone of the target text according to the second detection result;

If the first tone is the same as the second tone, determining the polarity feature as a first value; or

If the first tone is different from the second tone, the polarity feature is determined as a second value.
The method for determining similar texts according to claim 1, wherein the converting the text to be detected into a semantic vector to be detected comprises:

converting the text to be detected into a sequence of word vectors;

Use forward long short-term memory network to perform feature extraction on the word vector sequence to obtain a first feature vector;

Use the reverse long short-term memory network to perform feature extraction on the word vector sequence to obtain a second feature vector;

Splicing the first feature vector and the second feature vector to obtain the to-be-detected semantic vector.
The method for determining similar texts according to claim 1, wherein the generating the semantic features of the text to be detected and the target text according to the semantic vector to be detected and the target semantic vector comprises:

Subtract the target semantic vector from the to-be-detected semantic vector to obtain a difference vector;

Splicing the semantic vector to be detected, the target semantic vector and the difference vector to obtain a splicing semantic vector;

The spliced semantic vector is iteratively mapped by using a pre-built multi-layer hidden layer to obtain the semantic feature.
A similar text determination device, wherein the similar text determination device comprises:

a determination unit, configured to receive a similar text determination request, and determine the text to be detected according to the similar text determination request;

an obtaining unit for obtaining the target text from the similar text determination request;

a generating unit, configured to perform word segmentation processing on the text to be detected to obtain a plurality of word segmentations to be detected, and perform word segmentation processing on the target text to obtain a plurality of target word segmentations;

The generating unit is also used to obtain the union of the plurality of word segmentations to be detected and the plurality of target word segmentations to obtain all word segmentations;

The generating unit is further configured to generate a feature vector to be detected according to the plurality of word segmentations to be detected and the plurality of target word segmentations, and to generate a target feature vector according to the plurality of word segmentations to be detected and the plurality of target word segmentations;

The determining unit is further configured to calculate the similarity between the to-be-detected feature vector and the target feature vector, obtain the text similarity between the to-be-detected text and the target text, and divide the plurality of to-be-detected word segmentations. The intersection with the multiple target word segments is determined as a co-occurrence word;

The determining unit is also used to calculate the co-occurrence quantity of the co-occurrence words, and calculate the total amount of the word segmentation of all the word segmentations;

The determining unit is further configured to divide the co-occurrence number by the total amount of word segmentation to obtain a similarity coefficient;

The determining unit is further configured to determine the polarity features of the text to be detected and the target text according to the tone of the text to be detected and the tone of the target text;

The generating unit is further configured to generate text features of the text to be detected and the target text according to the text similarity, the similarity coefficient and the polarity feature;

a conversion unit, for converting the text to be detected into a semantic vector to be detected, and converting the target text into a target semantic vector;

The determining unit is further configured to generate semantic features of the text to be detected and the target text according to the semantic vector to be detected and the target semantic vector, and determine the text feature and the semantic feature to determine the Similar results between the text to be detected and the target text.
The apparatus for determining similar texts according to claim 7, wherein the converting unit to convert the text to be detected into a semantic vector to be detected comprises:

converting the text to be detected into a sequence of word vectors;

Use forward long short-term memory network to perform feature extraction on the word vector sequence to obtain a first feature vector;

Use the reverse long short-term memory network to perform feature extraction on the word vector sequence to obtain a second feature vector;

Splicing the first feature vector and the second feature vector to obtain the to-be-detected semantic vector.
An electronic device, wherein the electronic device includes a processor and a memory, and the processor is configured to execute at least one computer-readable instruction stored in the memory to implement the following steps:

receiving a similar text determination request, and determining the text to be detected according to the similar text determination request;

obtain the target text from the similar text determination request;

Perform word segmentation processing on the text to be detected to obtain multiple word segmentations to be detected, and perform word segmentation processing on the target text to obtain multiple target word segmentations;

Obtaining the union of the plurality of word segments to be detected and the plurality of target word segments to obtain all word segments;

Generate a feature vector to be detected according to the plurality of word segmentations to be detected and the plurality of target word segmentations, and generate a target feature vector according to the plurality of word segmentations to be detected and the plurality of target word segmentations;

Calculate the similarity between the to-be-detected feature vector and the target feature vector, obtain the text similarity between the to-be-detected text and the target text, and compare the multiple to-be-detected word segmentations with the multiple target word segmentations. The intersection is determined as a co-occurring word;

Calculate the number of co-occurrences of the co-occurrence words, and calculate the total amount of the word segmentation of all the word segmentation;

Divide the number of co-occurrences by the total amount of word segmentation to obtain a similarity coefficient;

Determine the polarity feature of the text to be detected and the target text according to the tone of the text to be detected and the tone of the target text;

Generate text features of the text to be detected and the target text according to the text similarity, the similarity coefficient and the polarity feature;

Converting the text to be detected into a semantic vector to be detected, and converting the target text into a target semantic vector;

Generate semantic features of the text to be detected and the target text according to the semantic vector to be detected and the target semantic vector, and determine the text to be detected and the target text according to the text features and the semantic features similar results.
The electronic device according to claim 9, wherein, when the text to be detected is determined according to the similar text determination request, the processor executes the at least one computer-readable instruction to implement the following steps:

Parsing the similar text to determine the requested message, and obtaining data information carried by the message;

Obtain information for indicating a location from the data information as a storage location;

A to-be-detected text library is determined from the storage location, and any text is extracted from the to-be-detected text library as the to-be-detected text.
The electronic device according to claim 9, wherein the processor executes the at least one computer-readable instruction when the to-be-detected feature vector is generated according to the plurality of to-be-detected word segments and the plurality of target word segments to implement the following steps:

The to-be-detected feature vector is generated according to the mapping relationship between the plurality of to-be-detected word segments and all of the word segments.
The electronic device according to claim 9, wherein, when determining the polarity features of the text to be detected and the target text according to the tone of the text to be detected and the tone of the target text, the processing The processor executes the at least one computer-readable instruction to implement the following steps:

Detecting whether the text to be detected contains a preset word to obtain a first detection result, and detecting whether the target text contains the preset word to obtain a second detection result, where the preset word is used to indicate a negative tone ;

Determine the first tone of the text to be detected according to the first detection result, and determine the second tone of the target text according to the second detection result;

If the first tone is the same as the second tone, determining the polarity feature as a first value; or

If the first tone is different from the second tone, the polarity feature is determined as a second value.
The electronic device according to claim 9, wherein, when converting the text to be detected into a semantic vector to be detected, the processor executes the at least one computer-readable instruction to implement the following steps:

converting the text to be detected into a sequence of word vectors;

Use forward long short-term memory network to perform feature extraction on the word vector sequence to obtain a first feature vector;

Use the reverse long short-term memory network to perform feature extraction on the word vector sequence to obtain a second feature vector;

Splicing the first feature vector and the second feature vector to obtain the to-be-detected semantic vector.
The electronic device according to claim 9, wherein when the semantic features of the text to be detected and the target text are generated according to the semantic vector to be detected and the target semantic vector, the processor executes the at least one computer-readable instruction to implement the following steps:

Subtract the target semantic vector from the to-be-detected semantic vector to obtain a difference vector;

Splicing the semantic vector to be detected, the target semantic vector and the difference vector to obtain a splicing semantic vector;

The spliced semantic vector is iteratively mapped by using a pre-built multi-layer hidden layer to obtain the semantic feature.
A computer-readable storage medium, wherein the computer-readable storage medium stores at least one computer-readable instruction, and the at least one computer-readable instruction implements the following steps when executed by a processor:

receiving a similar text determination request, and determining the text to be detected according to the similar text determination request;

obtain the target text from the similar text determination request;

Perform word segmentation processing on the text to be detected to obtain multiple word segmentations to be detected, and perform word segmentation processing on the target text to obtain multiple target word segmentations;

Obtaining the union of the plurality of word segments to be detected and the plurality of target word segments to obtain all word segments;

Generate a feature vector to be detected according to the plurality of word segmentations to be detected and the plurality of target word segmentations, and generate a target feature vector according to the plurality of word segmentations to be detected and the plurality of target word segmentations;

Calculate the similarity between the to-be-detected feature vector and the target feature vector, obtain the text similarity between the to-be-detected text and the target text, and compare the multiple to-be-detected word segmentations with the multiple target word segmentations. The intersection is determined as a co-occurring word;

Calculate the number of co-occurrences of the co-occurrence words, and calculate the total amount of the word segmentation of all the word segmentation;

Divide the number of co-occurrences by the total amount of word segmentation to obtain a similarity coefficient;

Determine the polarity feature of the text to be detected and the target text according to the tone of the text to be detected and the tone of the target text;

Generate text features of the text to be detected and the target text according to the text similarity, the similarity coefficient and the polarity feature;

Converting the text to be detected into a semantic vector to be detected, and converting the target text into a target semantic vector;

Generate semantic features of the text to be detected and the target text according to the semantic vector to be detected and the target semantic vector, and determine the text to be detected and the target text according to the text features and the semantic features similar results.
16. The storage medium of claim 15, wherein when the text to be detected is determined according to the similar text determination request, the at least one computer-readable instruction is executed by a processor to implement the following steps:

Parsing the similar text to determine the requested message, and obtaining data information carried by the message;

Obtain information for indicating a location from the data information as a storage location;

A to-be-detected text library is determined from the storage location, and any text is extracted from the to-be-detected text library as the to-be-detected text.
16. The storage medium of claim 15, wherein when the feature vector to be detected is generated according to the plurality of word segments to be detected and the plurality of target word segments, the at least one computer-readable instruction is executed by the processor to Implement the following steps:

The to-be-detected feature vector is generated according to the mapping relationship between the plurality of to-be-detected word segments and all of the word segments.
16. The storage medium according to claim 15, wherein, when determining the polarity characteristics of the text to be detected and the target text according to the tone of the text to be detected and the tone of the target text, the at least A computer readable instruction is executed by the processor to implement the following steps:

Detecting whether the text to be detected contains a preset word to obtain a first detection result, and detecting whether the target text contains the preset word to obtain a second detection result, where the preset word is used to indicate a negative tone ;

Determine the first tone of the text to be detected according to the first detection result, and determine the second tone of the target text according to the second detection result;

If the first tone is the same as the second tone, determining the polarity feature as a first value; or

If the first tone is different from the second tone, the polarity feature is determined as a second value.
The storage medium of claim 15, wherein, when the text to be detected is converted into a semantic vector to be detected, the at least one computer-readable instruction is executed by the processor to achieve the following steps:

converting the text to be detected into a sequence of word vectors;

Use forward long short-term memory network to perform feature extraction on the word vector sequence to obtain a first feature vector;

Use the reverse long short-term memory network to perform feature extraction on the word vector sequence to obtain a second feature vector;

Splicing the first feature vector and the second feature vector to obtain the to-be-detected semantic vector.
The storage medium according to claim 15, wherein, when generating the semantic features of the text to be detected and the target text according to the semantic vector to be detected and the target semantic vector, the at least one computer can The read instruction is executed by the processor to implement the following steps:

Subtract the target semantic vector from the to-be-detected semantic vector to obtain a difference vector;

Splicing the semantic vector to be detected, the target semantic vector and the difference vector to obtain a splicing semantic vector;

The spliced semantic vector is iteratively mapped by using a pre-built multi-layer hidden layer to obtain the semantic feature.