WO2022156180A1 - Similar text determination method and related device - Google Patents

Similar text determination method and related device Download PDF

Info

Publication number
WO2022156180A1
WO2022156180A1 PCT/CN2021/109391 CN2021109391W WO2022156180A1 WO 2022156180 A1 WO2022156180 A1 WO 2022156180A1 CN 2021109391 W CN2021109391 W CN 2021109391W WO 2022156180 A1 WO2022156180 A1 WO 2022156180A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
detected
target
word
vector
Prior art date
Application number
PCT/CN2021/109391
Other languages
French (fr)
Chinese (zh)
Inventor
李小娟
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2022156180A1 publication Critical patent/WO2022156180A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Definitions

  • the present application relates to the technical field of artificial intelligence, and in particular, to a method for determining similar texts and related devices.
  • the similarity of sentences is determined by the co-occurrence information of the text.
  • it cannot be accurately calculated.
  • the similarity between two texts reduces the accuracy of determining similar texts.
  • deep text similarity algorithms are generated.
  • sentences are mapped to semantics through the coding layer.
  • the inventor realized that if there are texts with similar text information but opposite meanings, the determination accuracy of similar texts will be low.
  • a first aspect of the present application provides a method for determining similar texts, the method for determining similar texts includes:
  • Calculate the similarity between the to-be-detected feature vector and the target feature vector obtain the text similarity between the to-be-detected text and the target text, and compare the multiple to-be-detected word segmentations with the multiple target word segmentations. The intersection is determined as a co-occurring word;
  • a second aspect of the present application provides an electronic device, the electronic device includes a processor and a memory, the processor is configured to execute computer-readable instructions stored in the memory to implement the following steps:
  • Calculate the similarity between the to-be-detected feature vector and the target feature vector obtain the text similarity between the to-be-detected text and the target text, and compare the multiple to-be-detected word segmentations with the multiple target word segmentations. The intersection is determined as a co-occurring word;
  • a third aspect of the present application provides a computer-readable storage medium on which at least one computer-readable instruction is stored, and the at least one computer-readable instruction is executed by a processor to implement the following steps:
  • Calculate the similarity between the to-be-detected feature vector and the target feature vector obtain the text similarity between the to-be-detected text and the target text, and compare the multiple to-be-detected word segmentations with the multiple target word segmentations. The intersection is determined as a co-occurring word;
  • a fourth aspect of the present application provides an apparatus for determining similar texts, and the apparatus for determining similar texts includes:
  • a determination unit configured to receive a similar text determination request, and determine the text to be detected according to the similar text determination request
  • a generating unit configured to perform word segmentation processing on the text to be detected to obtain a plurality of word segmentations to be detected, and perform word segmentation processing on the target text to obtain a plurality of target word segmentations
  • the generating unit is also used to obtain the union of the plurality of word segmentations to be detected and the plurality of target word segmentations to obtain all word segmentations;
  • the generating unit is further configured to generate a feature vector to be detected according to the plurality of word segmentations to be detected and the plurality of target word segmentations, and to generate a target feature vector according to the plurality of word segmentations to be detected and the plurality of target word segmentations;
  • the determining unit is further configured to calculate the similarity between the to-be-detected feature vector and the target feature vector, obtain the text similarity between the to-be-detected text and the target text, and divide the plurality of to-be-detected word segmentations. The intersection with the multiple target word segments is determined as a co-occurrence word;
  • the determining unit is also used to calculate the co-occurrence quantity of the co-occurrence words, and calculate the total amount of the word segmentation of all the word segmentations;
  • the determining unit is further configured to divide the co-occurrence number by the total amount of word segmentation to obtain a similarity coefficient
  • the determining unit is further configured to determine the polarity features of the text to be detected and the target text according to the tone of the text to be detected and the tone of the target text;
  • the generating unit is further configured to generate text features of the text to be detected and the target text according to the text similarity, the similarity coefficient and the polarity feature;
  • a conversion unit for converting the text to be detected into a semantic vector to be detected, and converting the target text into a target semantic vector
  • the determining unit is further configured to generate semantic features of the text to be detected and the target text according to the semantic vector to be detected and the target semantic vector, and determine the text feature and the semantic feature to determine the Similar results between the text to be detected and the target text.
  • the present application determines the text similarity, similarity coefficient and polarity feature of the text to be detected and the target text, because the polarity feature can characterize the text to be detected and the target text. Whether the tone of the target text is the same, therefore, the degree of similarity between the text to be detected and the target text can be accurately determined. The problem of low accuracy is solved, and the similarity result between the text to be detected and the target text can be accurately determined through the text feature and the semantic feature.
  • FIG. 1 is a flowchart of a preferred embodiment of the method for determining similar texts in the present application.
  • FIG. 2 is a flowchart of an embodiment of the present application for generating a feature vector to be detected.
  • FIG. 3 is a flow chart of an embodiment of the present application for generating semantic features.
  • FIG. 4 is a functional block diagram of a preferred embodiment of the apparatus for determining similar texts of the present application.
  • FIG. 5 is a schematic structural diagram of an electronic device implementing a preferred embodiment of the method for determining similar texts in the present application.
  • FIG. 1 it is a flowchart of a preferred embodiment of the method for determining similar texts of the present application. According to different requirements, the order of the steps in this flowchart can be changed, and some steps can be omitted.
  • the similar text determination method is applied to one or more electronic devices, the electronic device is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored computer-readable instructions, and its hardware Including but not limited to microprocessors, application specific integrated circuits (ASICs), programmable gate arrays (Field-Programmable Gate Arrays, FPGAs), digital processors (Digital Signal Processors, DSPs), embedded devices, etc.
  • ASICs application specific integrated circuits
  • FPGAs Field-Programmable Gate Arrays
  • DSPs Digital Signal Processors
  • embedded devices etc.
  • the electronic device can be any electronic product that can interact with the user, such as a personal computer, a tablet computer, a smart phone, a personal digital assistant (PDA), a game console, an interactive network television ( Internet Protocol Television, IPTV), smart wearable devices, etc.
  • a personal computer a tablet computer
  • a smart phone a personal digital assistant (PDA)
  • PDA personal digital assistant
  • IPTV interactive network television
  • smart wearable devices etc.
  • the electronic equipment may include network equipment and/or user equipment.
  • the network device includes, but is not limited to, a single network electronic device, an electronic device group composed of multiple network electronic devices, or a cloud composed of a large number of hosts or network electronic devices based on cloud computing (Cloud Computing).
  • the network where the electronic device is located includes, but is not limited to, the Internet, a wide area network, a metropolitan area network, a local area network, a virtual private network (Virtual Private Network, VPN), and the like.
  • VPN Virtual Private Network
  • S10 Receive a similar text determination request, and determine the text to be detected according to the similar text determination request.
  • the information carried in the similar text determination request includes, but is not limited to: target text, storage location, and the like.
  • the similar text determination request can be triggered by any user.
  • the text to be detected refers to the text that needs to be detected whether it is similar to the target text. There may be multiple texts to be detected.
  • the electronic device determining the text to be detected according to the similar text determination request includes:
  • a to-be-detected text library is determined from the storage location, and any text is extracted from the to-be-detected text library as the to-be-detected text.
  • the target text refers to the reference text in the similar text determination request.
  • the obtaining, by the electronic device, the target text from the similar text determination request includes:
  • Information for indicating text is acquired from the data information as the target text.
  • the target text since the target text is stored in the similar text determination request, the target text can be quickly acquired from the data information obtained by parsing.
  • FIG. 2 is a flowchart of an embodiment of the present application for generating a feature vector to be detected.
  • generating, by the electronic device, a feature vector to be detected according to the text to be detected and the target text includes:
  • S120 Perform word segmentation processing on the text to be detected to obtain multiple word segmentations to be detected, and perform word segmentation processing on the target text to obtain multiple target word segmentations.
  • the multiple word segments to be detected may be multiple words, and the multiple target word segments may be multiple words.
  • S121 Acquire the union of the multiple to-be-detected word segments and the multiple target word segments to obtain all the segmented words.
  • S122 Generate the feature vector to be detected according to the mapping relationship between the multiple word segments to be detected and all the word segments.
  • the mapping relationship refers to whether the multiple to-be-detected word segments exist in all the word segments.
  • the multiple word segments to be detected are: I, Immediately, Immediately, Help, You, Apply, Please, Okay, Do
  • the multiple target word segments are: I, No, Do, Fa, Help, You, Apply , please, therefore, all the participles mentioned are: I, help, apply, please, immediately, immediately, you, ok, ?
  • the multiple word segments to be detected do not appear, therefore, the feature vector to be detected is [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0].
  • the feature vector to be detected can be determined according to the text to be detected and the target text. Since the feature vector to be detected is generated according to the target text, the feature vector to be detected can be accurately determined. Feature vector to be detected.
  • the electronic device generating a target feature vector according to the text to be detected and the target text includes:
  • the target feature vector is generated according to the mapping relationship between the target word segment and all word segments.
  • S13 Calculate the similarity between the feature vector to be detected and the target feature vector, obtain the text similarity between the text to be detected and the target text, and determine a similarity coefficient according to the text to be detected and the target text .
  • the electronic device uses a cosine similarity calculation formula to calculate the similarity between the feature vector to be detected and the target feature vector.
  • cos ⁇ refers to the similarity between the feature vector to be detected and the target feature vector
  • n refers to the vector dimension of the feature vector to be detected and the target feature vector
  • i refers to the current vector dimension
  • x i refers to the feature vector to be detected
  • y i refers to the target feature vector.
  • the text similarity can be quickly determined through the cosine similarity calculation formula.
  • the electronic device determining the similarity coefficient according to the text to be detected and the target text includes:
  • the similarity coefficient is obtained by dividing the co-occurrence number by the total number of word segmentations.
  • the co-occurrence words are me, help, application, and request
  • the co-occurrence number of the co-occurrence words is calculated to be 4
  • the total number of word segmentations of all the participles is calculated to be 13.
  • the similarity coefficient can be accurately determined according to the co-occurrence words of the text to be detected and the target text.
  • the polarity characteristic includes 1 or 0.
  • the polarity feature is determined to be 1; when the tone of the text to be detected is different from the tone of the target text, the polarity feature is determined as 1. Sex characteristics were determined to be 0.
  • the electronic device determining the polarity characteristics of the text to be detected and the target text according to the tone of the text to be detected and the tone of the target text includes:
  • Detecting whether the text to be detected contains a preset word to obtain a first detection result, and detecting whether the target text contains the preset word to obtain a second detection result, where the preset word is used to indicate a negative tone ;
  • the first tone is the same as the second tone, determining the polarity feature as a first value
  • the polarity feature is determined as a second value.
  • the preset words include, but are not limited to: none, none, no.
  • the tone of the text to be detected and the target text can be accurately determined according to the preset words, and then the polarity feature can be accurately determined.
  • the text feature is obtained by splicing the text similarity, the similarity coefficient and the polarity feature.
  • the text similarity is 0.4714
  • the similarity coefficient is 0.3077
  • the polarity feature is 0.
  • the text feature obtained is [0.4714, 0.3077, 0].
  • S16 Convert the text to be detected into a semantic vector to be detected, and convert the target text into a target semantic vector.
  • the semantic vector to be detected includes the semantics of the text to be detected
  • the target semantic vector includes the semantics of the target text
  • the electronic device converting the text to be detected into a semantic vector to be detected includes:
  • the generated semantic vector to be detected can have the contextual semantics of the text to be detected, and the accuracy of determination of the semantic vector to be detected can be improved.
  • the similarity result includes that the text to be detected is similar to the target text, and the text to be detected is not similar to the target text.
  • FIG. 3 is a flowchart of an embodiment of generating semantic features of the present application.
  • the electronic device generating the semantic features of the text to be detected and the target text according to the semantic vector to be detected and the target semantic vector includes:
  • S172 Perform iterative mapping on the spliced semantic vector by using a pre-built multi-layer hidden layer to obtain the semantic feature.
  • the semantic feature is obtained according to the operation of the to-be-detected semantic vector and the target semantic vector
  • the semantic feature can include the to-be-detected text and the target text
  • the semantics in the semantics improves the accuracy of the semantic features.
  • determining, by the electronic device, according to the text feature and the semantic feature, the similarity result between the text to be detected and the target text includes:
  • the present application determines the text similarity, similarity coefficient and polarity feature of the text to be detected and the target text, because the polarity feature can characterize the text to be detected and the target text. Whether the tone of the target text is the same, therefore, the degree of similarity between the text to be detected and the target text can be accurately determined. The problem of low accuracy is solved, and the similarity result between the text to be detected and the target text can be accurately determined through the text feature and the semantic feature.
  • the similar text determination device 11 includes a determination unit 110 , an acquisition unit 111 , a generation unit 112 and a conversion unit 113 .
  • the module/unit referred to in this application refers to a series of computer-readable instruction segments that can be acquired by the processor 13 and can perform fixed functions, and are stored in the memory 12 . In this embodiment, the functions of each module/unit will be described in detail in subsequent embodiments.
  • the determination unit 110 receives the similar text determination request, and determines the text to be detected according to the similar text determination request.
  • the information carried in the similar text determination request includes, but is not limited to: target text, storage location, and the like.
  • the similar text determination request can be triggered by any user.
  • the text to be detected refers to the text that needs to be detected whether it is similar to the target text. There may be multiple texts to be detected.
  • the determining unit 110 determines the text to be detected according to the similar text determination request includes:
  • a to-be-detected text library is determined from the storage location, and any text is extracted from the to-be-detected text library as the to-be-detected text.
  • the obtaining unit 111 obtains the target text from the similar text determination request.
  • the target text refers to the reference text in the similar text determination request.
  • the acquiring unit 111 acquiring the target text from the similar text determination request includes:
  • Information for indicating text is acquired from the data information as the target text.
  • the target text since the target text is stored in the similar text determination request, the target text can be quickly acquired from the data information obtained by parsing.
  • the generating unit 112 generates a feature vector to be detected according to the text to be detected and the target text, and generates a target feature vector according to the text to be detected and the target text.
  • the generating unit 112 generates a feature vector to be detected according to the text to be detected and the target text, including:
  • the multiple word segments to be detected may be multiple words, and the multiple target word segments may be multiple words.
  • the to-be-detected feature vector is generated according to the mapping relationship between the plurality of to-be-detected word segments and all of the word segments.
  • the mapping relationship refers to whether the multiple to-be-detected word segments exist in all the word segments.
  • the multiple word segments to be detected are: I, Immediately, Immediately, Help, You, Apply, Please, Okay, Do
  • the multiple target word segments are: I, No, Do, Fa, Help, You, Apply , please, therefore, all the participles mentioned are: I, help, apply, please, immediately, immediately, you, ok, ?
  • the multiple word segments to be detected do not appear, therefore, the feature vector to be detected is [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0].
  • the feature vector to be detected can be determined according to the text to be detected and the target text. Since the feature vector to be detected is generated according to the target text, the feature vector to be detected can be accurately determined. Feature vector to be detected.
  • the generating unit 112 generates a target feature vector according to the text to be detected and the target text, including:
  • the target feature vector is generated according to the mapping relationship between the target word segment and all word segments.
  • the determining unit 110 calculates the similarity between the feature vector to be detected and the target feature vector, obtains the text similarity between the text to be detected and the target text, and according to the text to be detected and the target text Determine the similarity coefficient.
  • the determining unit 110 uses a cosine similarity calculation formula to calculate the similarity between the feature vector to be detected and the target feature vector.
  • cos ⁇ refers to the similarity between the feature vector to be detected and the target feature vector
  • n refers to the vector dimension of the feature vector to be detected and the target feature vector
  • i refers to the current vector dimension
  • x i refers to the feature vector to be detected
  • y i refers to the target feature vector.
  • the text similarity can be quickly determined through the cosine similarity calculation formula.
  • the determining unit 110 determining the similarity coefficient according to the text to be detected and the target text includes:
  • the similarity coefficient is obtained by dividing the co-occurrence number by the total number of word segmentations.
  • the co-occurrence words are me, help, application, and request
  • the co-occurrence number of the co-occurrence words is calculated to be 4
  • the total number of word segmentations of all the participles is calculated to be 13.
  • the similarity coefficient can be accurately determined according to the co-occurrence words of the text to be detected and the target text.
  • the determining unit 110 determines the polarity characteristics of the text to be detected and the target text according to the tone of the text to be detected and the tone of the target text.
  • the polarity characteristic includes 1 or 0.
  • the polarity feature is determined to be 1; when the tone of the text to be detected is different from the tone of the target text, the polarity feature is determined as 1. Sex characteristics were determined to be 0.
  • the determining unit 110 determines the polarity features of the text to be detected and the target text according to the tone of the text to be detected and the tone of the target text, including:
  • Detecting whether the text to be detected contains a preset word to obtain a first detection result, and detecting whether the target text contains the preset word to obtain a second detection result, where the preset word is used to indicate a negative tone ;
  • the first tone is the same as the second tone, determining the polarity feature as a first value
  • the polarity feature is determined as a second value.
  • the preset words include, but are not limited to: none, none, no.
  • the tone of the text to be detected and the target text can be accurately determined according to the preset words, and then the polarity feature can be accurately determined.
  • the generating unit 112 generates text features of the text to be detected and the target text according to the text similarity, the similarity coefficient and the polarity feature.
  • the text feature is obtained by splicing the text similarity, the similarity coefficient and the polarity feature.
  • the text similarity is 0.4714
  • the similarity coefficient is 0.3077
  • the polarity feature is 0.
  • the text feature obtained is [0.4714, 0.3077, 0].
  • the converting unit 113 converts the text to be detected into a semantic vector to be detected, and converts the target text into a target semantic vector.
  • the semantic vector to be detected includes the semantics of the text to be detected
  • the target semantic vector includes the semantics of the target text
  • the converting unit 113 converts the text to be detected into a semantic vector to be detected including:
  • the generated semantic vector to be detected can have the contextual semantics of the text to be detected, and the accuracy of determination of the semantic vector to be detected can be improved.
  • the determining unit 110 generates the semantic features of the text to be detected and the target text according to the semantic vector to be detected and the target semantic vector, and determines the text to be detected and the target text according to the text features and the semantic features. similar results for the target text.
  • the similarity result includes that the text to be detected is similar to the target text, and the text to be detected is not similar to the target text.
  • the determining unit 110 generates the semantic features of the text to be detected and the target text according to the semantic vector to be detected and the target semantic vector, including:
  • the spliced semantic vector is iteratively mapped by using a pre-built multi-layer hidden layer to obtain the semantic feature.
  • the semantic feature is obtained according to the operation of the to-be-detected semantic vector and the target semantic vector
  • the semantic feature can include the to-be-detected text and the target text
  • the semantics in the semantics improves the accuracy of the semantic features.
  • the determining unit 110 determines the similarity result between the text to be detected and the target text according to the text feature and the semantic feature, including:
  • the present application determines the text similarity, similarity coefficient and polarity feature of the text to be detected and the target text, because the polarity feature can characterize the text to be detected and the target text. Whether the tone of the target text is the same, therefore, the degree of similarity between the text to be detected and the target text can be accurately determined. The problem of low accuracy is solved, and the similarity result between the text to be detected and the target text can be accurately determined through the text feature and the semantic feature.
  • FIG. 5 it is a schematic structural diagram of an electronic device implementing a preferred embodiment of the method for determining similar texts of the present application.
  • the electronic device 1 includes, but is not limited to, a memory 12 , a processor 13 , and computer-readable instructions stored in the memory 12 and executable on the processor 13 , such as similar text determination programs.
  • the schematic diagram is only an example of the electronic device 1, and does not constitute a limitation on the electronic device 1, and may include more or less components than the one shown, or combine some components, or different Components, for example, the electronic device 1 may also include input and output devices, network access devices, buses, and the like.
  • the processor 13 may be a central processing unit (Central Processing Unit, CPU), or other general-purpose processors, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor can be a microprocessor or the processor can also be any conventional processor, etc.
  • the processor 13 is the computing core and control center of the electronic device 1, and uses various interfaces and lines to connect the entire electronic device. 1, and the operating system that executes the electronic device 1, as well as various installed applications, program codes, and the like.
  • the computer-readable instructions may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 12 and executed by the processor 13 to Complete this application.
  • the one or more modules/units may be a series of computer-readable instruction segments capable of accomplishing specific functions, and the computer-readable instruction segments are used to describe the execution process of the computer-readable instructions in the electronic device 1 .
  • the computer readable instructions may be divided into a determining unit 110 , an obtaining unit 111 , a generating unit 112 and a converting unit 113 .
  • the memory 12 can be used to store the computer-readable instructions and/or modules, and the processor 13 executes or executes the computer-readable instructions and/or modules stored in the memory 12 and invokes the computer-readable instructions and/or modules stored in the memory 12.
  • the data in the electronic device 1 realizes various functions of the electronic device 1 .
  • the memory 12 may mainly include a stored program area and a stored data area, wherein the stored program area may store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), and the like; the storage data area may Data and the like created according to the use of the electronic device are stored.
  • the memory 12 may include non-volatile and volatile memory such as: hard disk, internal memory, plug-in hard disk, Smart Media Card (SMC), Secure Digital (SD) card, Flash memory card (Flash) Card), at least one disk storage device, flash memory device, or other storage device.
  • non-volatile and volatile memory such as: hard disk, internal memory, plug-in hard disk, Smart Media Card (SMC), Secure Digital (SD) card, Flash memory card (Flash) Card), at least one disk storage device, flash memory device, or other storage device.
  • the memory 12 may be an external memory and/or an internal memory of the electronic device 1 . Further, the storage 12 may be a storage in physical form, such as a memory stick, a TF card (Trans-flash Card) and the like.
  • TF card Trans-flash Card
  • modules/units integrated in the electronic device 1 are implemented in the form of software functional units and sold or used as independent products, they may be stored in a computer-readable storage medium.
  • the computer-readable storage medium may be a non-volatile storage medium or a volatile storage medium.
  • the present application can implement all or part of the processes in the methods of the above embodiments, and can also be completed by instructing relevant hardware through computer-readable instructions, and the computer-readable instructions can be stored in a computer-readable storage medium.
  • the computer-readable instructions when executed by the processor, can implement the steps of the above-mentioned method embodiments.
  • the computer-readable instructions include computer-readable instruction codes
  • the computer-readable instruction codes may be in source code form, object code form, executable file, or some intermediate form, and the like.
  • the computer-readable medium may include: any entity or device capable of carrying the computer-readable instruction code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM, Read-Only). Memory), random access memory (RAM, Random Access Memory).
  • the blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • the memory 12 in the electronic device 1 stores computer-readable instructions to implement a method for determining similar text
  • the processor 13 can execute the computer-readable instructions to implement:
  • Calculate the similarity between the feature vector to be detected and the target feature vector obtain the text similarity between the text to be detected and the target text, and determine a similarity coefficient according to the text to be detected and the target text;
  • the computer-readable storage medium stores computer-readable instructions, wherein the computer-readable instructions are used to implement the following steps when executed by the processor 13:
  • Calculate the similarity of the feature vector to be detected and the target feature vector obtain the text similarity of the text to be detected and the target text, and determine a similarity coefficient according to the text to be detected and the target text;
  • modules described as separate components may or may not be physically separated, and components shown as modules may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional module in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware, or can be implemented in the form of hardware plus software function modules.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The present application relates to artificial intelligence, and provides a similar text determination method and a related device. The method comprises: determining a text to be detected and a target text; generating a feature vector to be detected and a target feature vector; calculating the similarity between the feature vector to be detected and the target feature vector; determining a similarity coefficient and polarity features; generating text features according to the text similarity, the similarity coefficient, and the polarity features; converting the text to be detected into a semantic vector to be detected, and converting the target text into a target semantic vector; generating semantic features of the text to be detected and the target text; and determining a similarity result according to the text features and the semantic features. The present application can improve the accuracy of determining a similar text. In addition, the present application further relates to blockchain technology, and the similar result may be stored in a blockchain.

Description

相似文本确定方法及相关设备Similar text determination method and related equipment
本申请要求于2021年01月19日提交中国专利局,申请号为202110071000.0,发明名称为“相似文本确定方法及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed on January 19, 2021 with the application number 202110071000.0 and the invention titled "Method for Determining Similar Texts and Related Equipment", the entire contents of which are incorporated into this application by reference .
技术领域technical field
本申请涉及人工智能技术领域,尤其涉及一种相似文本确定方法及相关设备。The present application relates to the technical field of artificial intelligence, and in particular, to a method for determining similar texts and related devices.
背景技术Background technique
目前,在传统的无监督文本相似度算法中,通过文字的共现信息确定句子的相似度,然而,如果在文本中出现了同词异义或者同义异词的词语,则无法准确计算出两个文本之间的相似度,从而导致相似文本确定的准确率降低,为了克服上述的缺陷,深度文本相似算法随之产生,在目前的深度文本相似算法中,通过编码层将句子映射到语义空间进而计算出文本的相似度,然而,发明人意识到,如果出现了文本信息相近但是含义相反的文本,会造成相似文本的确定准确率低下。At present, in the traditional unsupervised text similarity algorithm, the similarity of sentences is determined by the co-occurrence information of the text. However, if there are words with synonyms or synonyms in the text, it cannot be accurately calculated. The similarity between two texts reduces the accuracy of determining similar texts. In order to overcome the above shortcomings, deep text similarity algorithms are generated. In the current deep text similarity algorithms, sentences are mapped to semantics through the coding layer. However, the inventor realized that if there are texts with similar text information but opposite meanings, the determination accuracy of similar texts will be low.
发明内容SUMMARY OF THE INVENTION
鉴于以上内容,有必要提供一种相似文本确定方法及相关设备,能够提高相似文本的确定准确率。In view of the above content, it is necessary to provide a similar text determination method and related equipment, which can improve the determination accuracy of similar texts.
本申请的第一方面提供一种相似文本确定方法,所述相似文本确定方法包括:A first aspect of the present application provides a method for determining similar texts, the method for determining similar texts includes:
接收相似文本确定请求,并根据所述相似文本确定请求确定待检测文本;receiving a similar text determination request, and determining the text to be detected according to the similar text determination request;
从所述相似文本确定请求中获取目标文本;obtain the target text from the similar text determination request;
对所述待检测文本进行分词处理,得到多个待检测分词,并对所述目标文本进行分词处理,得到多个目标分词;Perform word segmentation processing on the text to be detected to obtain multiple word segmentations to be detected, and perform word segmentation processing on the target text to obtain multiple target word segmentations;
获取所述多个待检测分词与所述多个目标分词的并集,得到所有分词;Obtaining the union of the plurality of word segments to be detected and the plurality of target word segments to obtain all word segments;
根据所述多个待检测分词及所述多个目标分词生成待检测特征向量,并根据所述多个待检测分词及所述多个目标分词生成目标特征向量;Generate a feature vector to be detected according to the plurality of word segmentations to be detected and the plurality of target word segmentations, and generate a target feature vector according to the plurality of word segmentations to be detected and the plurality of target word segmentations;
计算所述待检测特征向量与所述目标特征向量的相似度,得到所述待检测文本与所述目标文本的文本相似度,并将所述多个待检测分词与所述多个目标分词的交集确定为共现词语;Calculate the similarity between the to-be-detected feature vector and the target feature vector, obtain the text similarity between the to-be-detected text and the target text, and compare the multiple to-be-detected word segmentations with the multiple target word segmentations. The intersection is determined as a co-occurring word;
计算所述共现词语的共现数量,并计算所述所有分词的分词总量;Calculate the number of co-occurrences of the co-occurrence words, and calculate the total amount of the word segmentation of all the word segmentation;
将所述共现数量除以所述分词总量,得到相似系数;Divide the number of co-occurrences by the total amount of word segmentation to obtain a similarity coefficient;
根据所述待检测文本的语气与所述目标文本的语气确定所述待检测文本与所述目标文本的极性特征;Determine the polarity feature of the text to be detected and the target text according to the tone of the text to be detected and the tone of the target text;
根据所述文本相似度、所述相似系数及所述极性特征生成所述待检测文本与所述目标文本的文本特征;Generate text features of the text to be detected and the target text according to the text similarity, the similarity coefficient and the polarity feature;
将所述待检测文本转换为待检测语义向量,并将所述目标文本转换为目标语义向量;Converting the text to be detected into a semantic vector to be detected, and converting the target text into a target semantic vector;
根据所述待检测语义向量及所述目标语义向量生成所述待检测文本与所述目标文本的语义特征,并根据所述文本特征及所述语义特征确定所述待检测文本与所述目标文本的相似结果。Generate semantic features of the text to be detected and the target text according to the semantic vector to be detected and the target semantic vector, and determine the text to be detected and the target text according to the text features and the semantic features similar results.
本申请的第二方面提供一种电子设备,所述电子设备包括处理器和存储器,所述处理器用于执行所述存储器中存储的计算机可读指令以实现以下步骤:A second aspect of the present application provides an electronic device, the electronic device includes a processor and a memory, the processor is configured to execute computer-readable instructions stored in the memory to implement the following steps:
接收相似文本确定请求,并根据所述相似文本确定请求确定待检测文本;receiving a similar text determination request, and determining the text to be detected according to the similar text determination request;
从所述相似文本确定请求中获取目标文本;obtain the target text from the similar text determination request;
对所述待检测文本进行分词处理,得到多个待检测分词,并对所述目标文本进行分词处理,得到多个目标分词;Perform word segmentation processing on the text to be detected to obtain multiple word segmentations to be detected, and perform word segmentation processing on the target text to obtain multiple target word segmentations;
获取所述多个待检测分词与所述多个目标分词的并集,得到所有分词;Obtaining the union of the plurality of word segments to be detected and the plurality of target word segments to obtain all word segments;
根据所述多个待检测分词及所述多个目标分词生成待检测特征向量,并根据所述多个待检测分词及所述多个目标分词生成目标特征向量;Generate a feature vector to be detected according to the plurality of word segmentations to be detected and the plurality of target word segmentations, and generate a target feature vector according to the plurality of word segmentations to be detected and the plurality of target word segmentations;
计算所述待检测特征向量与所述目标特征向量的相似度,得到所述待检测文本与所述目标文本的文本相似度,并将所述多个待检测分词与所述多个目标分词的交集确定为共现词语;Calculate the similarity between the to-be-detected feature vector and the target feature vector, obtain the text similarity between the to-be-detected text and the target text, and compare the multiple to-be-detected word segmentations with the multiple target word segmentations. The intersection is determined as a co-occurring word;
计算所述共现词语的共现数量,并计算所述所有分词的分词总量;Calculate the number of co-occurrences of the co-occurrence words, and calculate the total amount of the word segmentation of all the word segmentation;
将所述共现数量除以所述分词总量,得到相似系数;Divide the number of co-occurrences by the total amount of word segmentation to obtain a similarity coefficient;
根据所述待检测文本的语气与所述目标文本的语气确定所述待检测文本与所述目标文本的极性特征;Determine the polarity feature of the text to be detected and the target text according to the tone of the text to be detected and the tone of the target text;
根据所述文本相似度、所述相似系数及所述极性特征生成所述待检测文本与所述目标文本的文本特征;Generate text features of the text to be detected and the target text according to the text similarity, the similarity coefficient and the polarity feature;
将所述待检测文本转换为待检测语义向量,并将所述目标文本转换为目标语义向量;Converting the text to be detected into a semantic vector to be detected, and converting the target text into a target semantic vector;
根据所述待检测语义向量及所述目标语义向量生成所述待检测文本与所述目标文本的语义特征,并根据所述文本特征及所述语义特征确定所述待检测文本与所述目标文本的相似结果。Generate semantic features of the text to be detected and the target text according to the semantic vector to be detected and the target semantic vector, and determine the text to be detected and the target text according to the text features and the semantic features similar results.
本申请的第三方面提供一种计算机可读存储介质,所述计算机可读存储介质上存储有至少一个计算机可读指令,所述至少一个计算机可读指令被处理器执行以实现以下步骤:A third aspect of the present application provides a computer-readable storage medium on which at least one computer-readable instruction is stored, and the at least one computer-readable instruction is executed by a processor to implement the following steps:
接收相似文本确定请求,并根据所述相似文本确定请求确定待检测文本;receiving a similar text determination request, and determining the text to be detected according to the similar text determination request;
从所述相似文本确定请求中获取目标文本;obtain the target text from the similar text determination request;
对所述待检测文本进行分词处理,得到多个待检测分词,并对所述目标文本进行分词处理,得到多个目标分词;Perform word segmentation processing on the text to be detected to obtain multiple word segmentations to be detected, and perform word segmentation processing on the target text to obtain multiple target word segmentations;
获取所述多个待检测分词与所述多个目标分词的并集,得到所有分词;Obtaining the union of the plurality of word segments to be detected and the plurality of target word segments to obtain all word segments;
根据所述多个待检测分词及所述多个目标分词生成待检测特征向量,并根据所述多个待检测分词及所述多个目标分词生成目标特征向量;Generate a feature vector to be detected according to the plurality of word segmentations to be detected and the plurality of target word segmentations, and generate a target feature vector according to the plurality of word segmentations to be detected and the plurality of target word segmentations;
计算所述待检测特征向量与所述目标特征向量的相似度,得到所述待检测文本与所述目标文本的文本相似度,并将所述多个待检测分词与所述多个目标分词的交集确定为共现词语;Calculate the similarity between the to-be-detected feature vector and the target feature vector, obtain the text similarity between the to-be-detected text and the target text, and compare the multiple to-be-detected word segmentations with the multiple target word segmentations. The intersection is determined as a co-occurring word;
计算所述共现词语的共现数量,并计算所述所有分词的分词总量;Calculate the number of co-occurrences of the co-occurrence words, and calculate the total amount of the word segmentation of all the word segmentation;
将所述共现数量除以所述分词总量,得到相似系数;Divide the number of co-occurrences by the total amount of word segmentation to obtain a similarity coefficient;
根据所述待检测文本的语气与所述目标文本的语气确定所述待检测文本与所述目标文本的极性特征;Determine the polarity feature of the text to be detected and the target text according to the tone of the text to be detected and the tone of the target text;
根据所述文本相似度、所述相似系数及所述极性特征生成所述待检测文本与所述目标文本的文本特征;Generate text features of the text to be detected and the target text according to the text similarity, the similarity coefficient and the polarity feature;
将所述待检测文本转换为待检测语义向量,并将所述目标文本转换为目标语义向量;Converting the text to be detected into a semantic vector to be detected, and converting the target text into a target semantic vector;
根据所述待检测语义向量及所述目标语义向量生成所述待检测文本与所述目标文本的语义特征,并根据所述文本特征及所述语义特征确定所述待检测文本与所述目标文本的相似结果。Generate semantic features of the text to be detected and the target text according to the semantic vector to be detected and the target semantic vector, and determine the text to be detected and the target text according to the text features and the semantic features similar results.
本申请的第四方面提供一种相似文本确定装置,所述相似文本确定装置包括:A fourth aspect of the present application provides an apparatus for determining similar texts, and the apparatus for determining similar texts includes:
确定单元,用于接收相似文本确定请求,并根据所述相似文本确定请求确定待检测文本;a determination unit, configured to receive a similar text determination request, and determine the text to be detected according to the similar text determination request;
获取单元,用于从所述相似文本确定请求中获取目标文本;an obtaining unit for obtaining the target text from the similar text determination request;
生成单元,用于对所述待检测文本进行分词处理,得到多个待检测分词,并对所述目标文本进行分词处理,得到多个目标分词;a generating unit, configured to perform word segmentation processing on the text to be detected to obtain a plurality of word segmentations to be detected, and perform word segmentation processing on the target text to obtain a plurality of target word segmentations;
所述生成单元,还用于获取所述多个待检测分词与所述多个目标分词的并集,得到所有分词;The generating unit is also used to obtain the union of the plurality of word segmentations to be detected and the plurality of target word segmentations to obtain all word segmentations;
所述生成单元,还用于根据所述多个待检测分词及所述多个目标分词生成待检测特征向量,并根据所述多个待检测分词及所述多个目标分词生成目标特征向量;The generating unit is further configured to generate a feature vector to be detected according to the plurality of word segmentations to be detected and the plurality of target word segmentations, and to generate a target feature vector according to the plurality of word segmentations to be detected and the plurality of target word segmentations;
所述确定单元,还用于计算所述待检测特征向量与所述目标特征向量的相似度,得到所述待检测文本与所述目标文本的文本相似度,并将所述多个待检测分词与所述多个目标分词的交集确定为共现词语;The determining unit is further configured to calculate the similarity between the to-be-detected feature vector and the target feature vector, obtain the text similarity between the to-be-detected text and the target text, and divide the plurality of to-be-detected word segmentations. The intersection with the multiple target word segments is determined as a co-occurrence word;
所述确定单元,还用于计算所述共现词语的共现数量,并计算所述所有分词的分词总量;The determining unit is also used to calculate the co-occurrence quantity of the co-occurrence words, and calculate the total amount of the word segmentation of all the word segmentations;
所述确定单元,还用于将所述共现数量除以所述分词总量,得到相似系数;The determining unit is further configured to divide the co-occurrence number by the total amount of word segmentation to obtain a similarity coefficient;
所述确定单元,还用于根据所述待检测文本的语气与所述目标文本的语气确定所述待检测文本与所述目标文本的极性特征;The determining unit is further configured to determine the polarity features of the text to be detected and the target text according to the tone of the text to be detected and the tone of the target text;
所述生成单元,还用于根据所述文本相似度、所述相似系数及所述极性特征生成所述待检测文本与所述目标文本的文本特征;The generating unit is further configured to generate text features of the text to be detected and the target text according to the text similarity, the similarity coefficient and the polarity feature;
转换单元,用于将所述待检测文本转换为待检测语义向量,并将所述目标文本转换为目标语义向量;a conversion unit, for converting the text to be detected into a semantic vector to be detected, and converting the target text into a target semantic vector;
所述确定单元,还用于根据所述待检测语义向量及所述目标语义向量生成所述待检测文本与所述目标文本的语义特征,并根据所述文本特征及所述语义特征确定所述待检测文本与所述目标文本的相似结果。The determining unit is further configured to generate semantic features of the text to be detected and the target text according to the semantic vector to be detected and the target semantic vector, and determine the text feature and the semantic feature to determine the Similar results between the text to be detected and the target text.
由以上技术方案可以看出,本申请通过确定所述待检测文本与所述目标文本的文本相似度、相似系数以及极性特征,由于所述极性特征能够表征出所述待检测文本与所述目标文本的语气是否相同,因此,能够准确确定所述待检测文本与所述目标文本的相似程度,通过所述语义特征的确定,避免了同词异义或者同义异词的出现造成的准确度低下的问题,进而通过所述文本特征及所述语义特征能够准确确定出所述待检测文本与所述目标文本的相似结果。It can be seen from the above technical solutions that the present application determines the text similarity, similarity coefficient and polarity feature of the text to be detected and the target text, because the polarity feature can characterize the text to be detected and the target text. Whether the tone of the target text is the same, therefore, the degree of similarity between the text to be detected and the target text can be accurately determined. The problem of low accuracy is solved, and the similarity result between the text to be detected and the target text can be accurately determined through the text feature and the semantic feature.
附图说明Description of drawings
图1是本申请相似文本确定方法的较佳实施例的流程图。FIG. 1 is a flowchart of a preferred embodiment of the method for determining similar texts in the present application.
图2是本申请生成待检测特征向量的一实施例的流程图。FIG. 2 is a flowchart of an embodiment of the present application for generating a feature vector to be detected.
图3是本申请生成语义特征的一实施例的流程图。FIG. 3 is a flow chart of an embodiment of the present application for generating semantic features.
图4是本申请相似文本确定装置的较佳实施例的功能模块图。FIG. 4 is a functional block diagram of a preferred embodiment of the apparatus for determining similar texts of the present application.
图5是本申请实现相似文本确定方法的较佳实施例的电子设备的结构示意图。FIG. 5 is a schematic structural diagram of an electronic device implementing a preferred embodiment of the method for determining similar texts in the present application.
具体实施方式Detailed ways
为了使本申请的目的、技术方案和优点更加清楚,下面结合附图和具体实施例对本申请进行详细描述。In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in detail below with reference to the accompanying drawings and specific embodiments.
如图1所示,是本申请相似文本确定方法的较佳实施例的流程图。根据不同的需求,该流程图中步骤的顺序可以改变,某些步骤可以省略。As shown in FIG. 1 , it is a flowchart of a preferred embodiment of the method for determining similar texts of the present application. According to different requirements, the order of the steps in this flowchart can be changed, and some steps can be omitted.
所述相似文本确定方法应用于一个或者多个电子设备中,所述电子设备是一种能够按照事先设定或存储的计算机可读指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字处理器(Digital Signal Processor,DSP)、嵌入式设备等。The similar text determination method is applied to one or more electronic devices, the electronic device is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored computer-readable instructions, and its hardware Including but not limited to microprocessors, application specific integrated circuits (ASICs), programmable gate arrays (Field-Programmable Gate Arrays, FPGAs), digital processors (Digital Signal Processors, DSPs), embedded devices, etc.
所述电子设备可以是任何一种可与用户进行人机交互的电子产品,例如,个人计算机、平板电脑、智能手机、个人数字助理(Personal Digital Assistant,PDA)、游戏机、交互式网络电视(Internet Protocol Television,IPTV)、智能穿戴式设备等。The electronic device can be any electronic product that can interact with the user, such as a personal computer, a tablet computer, a smart phone, a personal digital assistant (PDA), a game console, an interactive network television ( Internet Protocol Television, IPTV), smart wearable devices, etc.
所述电子设备可以包括网络设备和/或用户设备。其中,所述网络设备包括,但不限于单个网络电子设备、多个网络电子设备组成的电子设备组或基于云计算(Cloud Computing)的由大量主机或网络电子设备构成的云。The electronic equipment may include network equipment and/or user equipment. Wherein, the network device includes, but is not limited to, a single network electronic device, an electronic device group composed of multiple network electronic devices, or a cloud composed of a large number of hosts or network electronic devices based on cloud computing (Cloud Computing).
所述电子设备所处的网络包括,但不限于:互联网、广域网、城域网、局域网、虚拟专用网络(Virtual Private Network,VPN)等。The network where the electronic device is located includes, but is not limited to, the Internet, a wide area network, a metropolitan area network, a local area network, a virtual private network (Virtual Private Network, VPN), and the like.
S10,接收相似文本确定请求,并根据所述相似文本确定请求确定待检测文本。S10: Receive a similar text determination request, and determine the text to be detected according to the similar text determination request.
在本申请的至少一个实施例中,所述相似文本确定请求中携带的信息包括,但不限于:目标文本、存储位置等。所述相似文本确定请求可以由任意用户触发。In at least one embodiment of the present application, the information carried in the similar text determination request includes, but is not limited to: target text, storage location, and the like. The similar text determination request can be triggered by any user.
所述待检测文本是指需要检测是否与所述目标文本相似的文本。所述待检测文本可以有多个。The text to be detected refers to the text that needs to be detected whether it is similar to the target text. There may be multiple texts to be detected.
在本申请的至少一个实施例中,所述电子设备根据所述相似文本确定请求确定待检测文本包括:In at least one embodiment of the present application, the electronic device determining the text to be detected according to the similar text determination request includes:
解析所述相似文本确定请求的报文,得到所述报文携带的数据信息;Parsing the similar text to determine the requested message, and obtaining data information carried by the message;
从所述数据信息中获取用于指示位置的信息作为存储位置;Obtain information for indicating a location from the data information as a storage location;
从所述存储位置中确定待检测文本库,并从所述待检测文本库中提取任意文本作为所述待检测文本。A to-be-detected text library is determined from the storage location, and any text is extracted from the to-be-detected text library as the to-be-detected text.
通过上述实施方式,由于无需解析整个所述相似文本确定请求,因此能够提高所述存储位置的获取效率,进而能够快速获取到所述待检测文本。Through the above implementation manner, since it is not necessary to parse the entire similar text determination request, the efficiency of obtaining the storage location can be improved, and the text to be detected can be quickly obtained.
S11,从所述相似文本确定请求中获取目标文本。S11, obtain the target text from the similar text determination request.
在本申请的至少一个实施例中,所述目标文本是指所述相似文本确定请求中的基准文本。In at least one embodiment of the present application, the target text refers to the reference text in the similar text determination request.
在本申请的至少一个实施例中,所述电子设备从所述相似文本确定请求中获取目标文本包括:In at least one embodiment of the present application, the obtaining, by the electronic device, the target text from the similar text determination request includes:
从所述数据信息中获取用于指示文本的信息作为所述目标文本。Information for indicating text is acquired from the data information as the target text.
通过上述实施方式,由于所述相似文本确定请求中存储了所述目标文本,因此,能够从解析得到的所述数据信息中快速获取到所述目标文本。Through the above implementation manner, since the target text is stored in the similar text determination request, the target text can be quickly acquired from the data information obtained by parsing.
S12,根据所述待检测文本及所述目标文本生成待检测特征向量,并根据所述待检测文本及所述目标文本生成目标特征向量。S12, generating a feature vector to be detected according to the text to be detected and the target text, and generating a target feature vector according to the text to be detected and the target text.
参见图2,图2是本申请生成待检测特征向量的一实施例的流程图。在本申请的至少一个实施例中,所述电子设备根据所述待检测文本及所述目标文本生成待检测特征向量包括:Referring to FIG. 2, FIG. 2 is a flowchart of an embodiment of the present application for generating a feature vector to be detected. In at least one embodiment of the present application, generating, by the electronic device, a feature vector to be detected according to the text to be detected and the target text includes:
S120,对所述待检测文本进行分词处理,得到多个待检测分词,并对所述目标文本进行分词处理,得到多个目标分词。S120: Perform word segmentation processing on the text to be detected to obtain multiple word segmentations to be detected, and perform word segmentation processing on the target text to obtain multiple target word segmentations.
所述多个待检测分词可以是多个字词,所述多个目标分词可以是多个字词。The multiple word segments to be detected may be multiple words, and the multiple target word segments may be multiple words.
S121,获取所述多个待检测分词与所述多个目标分词的并集,得到所有分词。S121: Acquire the union of the multiple to-be-detected word segments and the multiple target word segments to obtain all the segmented words.
S122,根据所述多个待检测分词与所述所有分词的映射关系生成所述待检测特征向量。S122: Generate the feature vector to be detected according to the mapping relationship between the multiple word segments to be detected and all the word segments.
所述映射关系是指所述多个待检测分词是否存在于所述所有分词中。The mapping relationship refers to whether the multiple to-be-detected word segments exist in all the word segments.
例如:所述多个待检测分词为:我、立、即、帮、你、申、请、好、吗,所述多个目标分词为:我、没、办、法、帮、您、申、请,因此,所述所有分词为:我、帮、申、请、立、即、你、好、吗、没、办、法、您,由于“没、办、法、您”在所述多个待检测分词中没有出现,因此,所述待检测特征向量为[1,1,1,1,1,1,1,1,1,0,0,0,0]。For example, the multiple word segments to be detected are: I, Immediately, Immediately, Help, You, Apply, Please, Okay, Do, and the multiple target word segments are: I, No, Do, Fa, Help, You, Apply , please, therefore, all the participles mentioned are: I, help, apply, please, immediately, immediately, you, ok, ? The multiple word segments to be detected do not appear, therefore, the feature vector to be detected is [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0].
通过上述实施方式,能够根据所述待检测文本及所述目标文本确定出所述待检测特征向量,由于所述待检测特征向量是根据所述目标文本生成的,因此能够准确地确定出所述待检测特征向量。Through the above implementation, the feature vector to be detected can be determined according to the text to be detected and the target text. Since the feature vector to be detected is generated according to the target text, the feature vector to be detected can be accurately determined. Feature vector to be detected.
在本申请的至少一个实施例中,所述电子设备根据所述待检测文本及所述目标文本生成目标特征向量包括:In at least one embodiment of the present application, the electronic device generating a target feature vector according to the text to be detected and the target text includes:
根据所述目标分词与所述所有分词的映射关系生成所述目标特征向量。The target feature vector is generated according to the mapping relationship between the target word segment and all word segments.
S13,计算所述待检测特征向量与所述目标特征向量的相似度,得到所述待检测文本与所述目标文本的文本相似度,并根据所述待检测文本及所述目标文本确定相似系数。S13: Calculate the similarity between the feature vector to be detected and the target feature vector, obtain the text similarity between the text to be detected and the target text, and determine a similarity coefficient according to the text to be detected and the target text .
在本申请的至少一个实施例中,所述电子设备利用余弦相似度计算公式计算所述待检测特征向量与所述目标特征向量的相似度。In at least one embodiment of the present application, the electronic device uses a cosine similarity calculation formula to calculate the similarity between the feature vector to be detected and the target feature vector.
具体的余弦相似度计算公式如下:The specific cosine similarity calculation formula is as follows:
Figure PCTCN2021109391-appb-000001
Figure PCTCN2021109391-appb-000001
其中,cosθ是指所述待检测特征向量与所述目标特征向量的相似度,n是指所述待检测特征向量及所述目标特征向量的向量维度,i是指当前向量维度,x i是指所述待检测特征向量,y i是指所述目标特征向量。 Among them, cosθ refers to the similarity between the feature vector to be detected and the target feature vector, n refers to the vector dimension of the feature vector to be detected and the target feature vector, i refers to the current vector dimension, x i is refers to the feature vector to be detected, and y i refers to the target feature vector.
通过所述余弦相似度计算公式能够快速确定出所述文本相似度。The text similarity can be quickly determined through the cosine similarity calculation formula.
在本申请的至少一个实施例中,所述电子设备根据所述待检测文本及所述目标文本确定相似系数包括:In at least one embodiment of the present application, the electronic device determining the similarity coefficient according to the text to be detected and the target text includes:
将所述多个待检测分词与所述多个目标分词的交集确定为共现词语;Determining the intersection of the plurality of word segments to be detected and the plurality of target word segments as co-occurring words;
计算所述共现词语的共现数量,并计算所述所有分词的分词总量;Calculate the number of co-occurrences of the co-occurrence words, and calculate the total amount of the word segmentation of all the word segmentation;
将所述共现数量除以所述分词总量,得到所述相似系数。The similarity coefficient is obtained by dividing the co-occurrence number by the total number of word segmentations.
承接上述例子,所述共现词语为我、帮、申、请,计算所述共现词语的共现数量为4,计算所述所有分词的分词总量为13,经计算,得到所述相似系数为
Figure PCTCN2021109391-appb-000002
Following the above example, the co-occurrence words are me, help, application, and request, the co-occurrence number of the co-occurrence words is calculated to be 4, and the total number of word segmentations of all the participles is calculated to be 13. After calculation, the similarity is obtained. The coefficient is
Figure PCTCN2021109391-appb-000002
通过上述实施方式,能够根据所述待检测文本与所述目标文本的共现词语,准确地确定出所述相似系数。Through the above-mentioned embodiments, the similarity coefficient can be accurately determined according to the co-occurrence words of the text to be detected and the target text.
S14,根据所述待检测文本的语气与所述目标文本的语气确定所述待检测文本与所述目标文本的极性特征。S14. Determine the polarity feature of the text to be detected and the target text according to the tone of the text to be detected and the tone of the target text.
在本申请的至少一个实施例中,所述极性特征包括1或者0。当所述待检测文本的语气与所述目标文本的语气相同时,将所述极性特征确定为1,当所述待检测文本的语气与所述目标文本的语气不同时,将所述极性特征确定为0。In at least one embodiment of the present application, the polarity characteristic includes 1 or 0. When the tone of the text to be detected is in phase with the tone of the target text, the polarity feature is determined to be 1; when the tone of the text to be detected is different from the tone of the target text, the polarity feature is determined as 1. Sex characteristics were determined to be 0.
在本申请的至少一个实施例中,所述电子设备根据所述待检测文本的语气与所述目标文本的语气确定所述待检测文本与所述目标文本的极性特征包括:In at least one embodiment of the present application, the electronic device determining the polarity characteristics of the text to be detected and the target text according to the tone of the text to be detected and the tone of the target text includes:
检测所述待检测文本中是否包含预设词语,得到第一检测结果,并检测所述目标文本中是否包含所述预设词语,得到第二检测结果,所述预设词语用于指示否定语气;Detecting whether the text to be detected contains a preset word to obtain a first detection result, and detecting whether the target text contains the preset word to obtain a second detection result, where the preset word is used to indicate a negative tone ;
根据所述第一检测结果确定所述待检测文本的第一语气,并根据所述第二检测结果确定所述目标文本的第二语气;Determine the first tone of the text to be detected according to the first detection result, and determine the second tone of the target text according to the second detection result;
若所述第一语气与所述第二语气相同,将所述极性特征确定为第一数值;或者If the first tone is the same as the second tone, determining the polarity feature as a first value; or
若所述第一语气与所述第二语气不同,将所述极性特征确定为第二数值。If the first tone is different from the second tone, the polarity feature is determined as a second value.
其中,所述预设词语包括,但不限于:无、没有、不。Wherein, the preset words include, but are not limited to: none, none, no.
通过上述实施方式,能够根据所述预设词语准确确定出所述待检测文本及所述目标文本的语气,进而能够准确确定出所述极性特征。Through the above-mentioned embodiments, the tone of the text to be detected and the target text can be accurately determined according to the preset words, and then the polarity feature can be accurately determined.
S15,根据所述文本相似度、所述相似系数及所述极性特征生成所述待检测文本与所述目标文本的文本特征。S15. Generate text features of the text to be detected and the target text according to the text similarity, the similarity coefficient, and the polarity feature.
在本申请的至少一个实施例中,所述文本特征是根据所述文本相似度、所述相似系数及所述极性特征拼接而得到的。In at least one embodiment of the present application, the text feature is obtained by splicing the text similarity, the similarity coefficient and the polarity feature.
例如,所述文本相似度为0.4714,所述相似系数为0.3077,所述极性特征为0,经拼接后,得到所述文本特征为[0.4714,0.3077,0]。For example, the text similarity is 0.4714, the similarity coefficient is 0.3077, and the polarity feature is 0. After splicing, the text feature obtained is [0.4714, 0.3077, 0].
S16,将所述待检测文本转换为待检测语义向量,并将所述目标文本转换为目标语义向量。S16: Convert the text to be detected into a semantic vector to be detected, and convert the target text into a target semantic vector.
在本申请的至少一个实施例中,所述待检测语义向量中包含所述待检测文本中的语义,所述目标语义向量中包含所述目标文本的语义。In at least one embodiment of the present application, the semantic vector to be detected includes the semantics of the text to be detected, and the target semantic vector includes the semantics of the target text.
在本申请的至少一个实施例中,所述电子设备将所述待检测文本转换为待检测语义向量包括:In at least one embodiment of the present application, the electronic device converting the text to be detected into a semantic vector to be detected includes:
将所述待检测文本转换为字向量序列;converting the text to be detected into a sequence of word vectors;
利用正向长短期记忆网络对所述字向量序列进行特征抽取,得到第一特征向量;Use forward long short-term memory network to perform feature extraction on the word vector sequence to obtain a first feature vector;
利用反向长短期记忆网络对所述字向量序列进行特征抽取,得到第二特征向量;Use the reverse long short-term memory network to perform feature extraction on the word vector sequence to obtain a second feature vector;
拼接所述第一特征向量及所述第二特征向量,得到所述待检测语义向量。Splicing the first feature vector and the second feature vector to obtain the to-be-detected semantic vector.
通过上述实施方式,能够使生成的所述待检测语义向量具有所述待检测文本的上下文语义,提高所述待检测语义向量的确定准确度。Through the above-mentioned embodiments, the generated semantic vector to be detected can have the contextual semantics of the text to be detected, and the accuracy of determination of the semantic vector to be detected can be improved.
S17,根据所述待检测语义向量及所述目标语义向量生成所述待检测文本与所述目标文本的语义特征,并根据所述文本特征及所述语义特征确定所述待检测文本与所述目标文本的相似结果。S17, generating semantic features of the text to be detected and the target text according to the semantic vector to be detected and the target semantic vector, and determining the text to be detected and the text to be detected according to the text features and the semantic features Similar results for the target text.
需要强调的是,为进一步保证上述相似结果的私密和安全性,上述相似结果还可以存储于一区块链的节点中。It should be emphasized that, in order to further ensure the privacy and security of the above similar results, the above similar results can also be stored in a node of a blockchain.
在本申请的至少一个实施例中,所述相似结果包括所述待检测文本与所述目标文本相似,所述待检测文本与所述目标文本不相似。In at least one embodiment of the present application, the similarity result includes that the text to be detected is similar to the target text, and the text to be detected is not similar to the target text.
参见图3,图3是本申请生成语义特征的一实施例的流程图。在本申请的至少一个实施例中,所述电子设备根据所述待检测语义向量及所述目标语义向量生成所述待检测文本与所述目标文本的语义特征包括:Referring to FIG. 3 , FIG. 3 is a flowchart of an embodiment of generating semantic features of the present application. In at least one embodiment of the present application, the electronic device generating the semantic features of the text to be detected and the target text according to the semantic vector to be detected and the target semantic vector includes:
S170,将所述待检测语义向量减去所述目标语义向量,得到差向量。S170, subtract the target semantic vector from the to-be-detected semantic vector to obtain a difference vector.
S171,拼接所述待检测语义向量、所述目标语义向量及所述差向量,得到拼接语义向量。S171, splicing the to-be-detected semantic vector, the target semantic vector, and the difference vector to obtain a spliced semantic vector.
S172,利用预先构建好的多层隐层对所述拼接语义向量进行迭代映射,得到所述语义特征。S172: Perform iterative mapping on the spliced semantic vector by using a pre-built multi-layer hidden layer to obtain the semantic feature.
通过上述实施方式,由于所述语义特征是根据所述待检测语义向量与所述目标语义向量的运算而得到的,因此,能够使所述语义特征中具有所述待检测文本及所述目标文本中的语义,提高了所述语义特征的准确度。Through the above-mentioned embodiment, since the semantic feature is obtained according to the operation of the to-be-detected semantic vector and the target semantic vector, the semantic feature can include the to-be-detected text and the target text The semantics in the semantics improves the accuracy of the semantic features.
在本申请的至少一个实施例中,所述电子设备根据所述文本特征及所述语义特征确定所述待检测文本与所述目标文本的相似结果包括:In at least one embodiment of the present application, determining, by the electronic device, according to the text feature and the semantic feature, the similarity result between the text to be detected and the target text includes:
对所述文本特征及所述语义特征进行拼接,得到目标向量;Splicing the text feature and the semantic feature to obtain a target vector;
将所述目标向量输入至预先构建好的二分类网络中,得到所述相似结果。Input the target vector into a pre-built binary classification network to obtain the similar result.
通过上述实施方式,由于所述相似结果是利用所述文本特征及所述语义特征确定的,因此,能够准确确定出所述相似结果。Through the above implementation manner, since the similar results are determined by using the text features and the semantic features, the similar results can be accurately determined.
由以上技术方案可以看出,本申请通过确定所述待检测文本与所述目标文本的文本相似度、相似系数以及极性特征,由于所述极性特征能够表征出所述待检测文本与所述目标文本的语气是否相同,因此,能够准确确定所述待检测文本与所述目标文本的相似程度,通过所述语义特征的确定,避免了同词异义或者同义异词的出现造成的准确度低下的问题,进而通过所述文本特征及所述语义特征能够准确确定出所述待检测文本与所述目标文本的相似结果。It can be seen from the above technical solutions that the present application determines the text similarity, similarity coefficient and polarity feature of the text to be detected and the target text, because the polarity feature can characterize the text to be detected and the target text. Whether the tone of the target text is the same, therefore, the degree of similarity between the text to be detected and the target text can be accurately determined. The problem of low accuracy is solved, and the similarity result between the text to be detected and the target text can be accurately determined through the text feature and the semantic feature.
如图4所示,是本申请相似文本确定装置的较佳实施例的功能模块图。所述相似文本确定装置11包括确定单元110、获取单元111、生成单元112及转换单元113。本申请所称的模块/单元是指一种能够被处理器13所获取,并且能够完成固定功能的一系列计算机可读指令段,其存储在存储器12中。在本实施例中,关于各模块/单元的功能将在后续的实施例中详述。As shown in FIG. 4 , it is a functional block diagram of a preferred embodiment of the apparatus for determining similar texts of the present application. The similar text determination device 11 includes a determination unit 110 , an acquisition unit 111 , a generation unit 112 and a conversion unit 113 . The module/unit referred to in this application refers to a series of computer-readable instruction segments that can be acquired by the processor 13 and can perform fixed functions, and are stored in the memory 12 . In this embodiment, the functions of each module/unit will be described in detail in subsequent embodiments.
确定单元110接收相似文本确定请求,并根据所述相似文本确定请求确定待检测文 本。The determination unit 110 receives the similar text determination request, and determines the text to be detected according to the similar text determination request.
在本申请的至少一个实施例中,所述相似文本确定请求中携带的信息包括,但不限于:目标文本、存储位置等。所述相似文本确定请求可以由任意用户触发。In at least one embodiment of the present application, the information carried in the similar text determination request includes, but is not limited to: target text, storage location, and the like. The similar text determination request can be triggered by any user.
所述待检测文本是指需要检测是否与所述目标文本相似的文本。所述待检测文本可以有多个。The text to be detected refers to the text that needs to be detected whether it is similar to the target text. There may be multiple texts to be detected.
在本申请的至少一个实施例中,所述确定单元110根据所述相似文本确定请求确定待检测文本包括:In at least one embodiment of the present application, the determining unit 110 determines the text to be detected according to the similar text determination request includes:
解析所述相似文本确定请求的报文,得到所述报文携带的数据信息;Parsing the similar text to determine the requested message, and obtaining data information carried by the message;
从所述数据信息中获取用于指示位置的信息作为存储位置;Obtain information for indicating a location from the data information as a storage location;
从所述存储位置中确定待检测文本库,并从所述待检测文本库中提取任意文本作为所述待检测文本。A to-be-detected text library is determined from the storage location, and any text is extracted from the to-be-detected text library as the to-be-detected text.
通过上述实施方式,由于无需解析整个所述相似文本确定请求,因此能够提高所述存储位置的获取效率,进而能够快速获取到所述待检测文本。Through the above implementation manner, since it is not necessary to parse the entire similar text determination request, the efficiency of obtaining the storage location can be improved, and the text to be detected can be quickly obtained.
获取单元111从所述相似文本确定请求中获取目标文本。The obtaining unit 111 obtains the target text from the similar text determination request.
在本申请的至少一个实施例中,所述目标文本是指所述相似文本确定请求中的基准文本。In at least one embodiment of the present application, the target text refers to the reference text in the similar text determination request.
在本申请的至少一个实施例中,所述获取单元111从所述相似文本确定请求中获取目标文本包括:In at least one embodiment of the present application, the acquiring unit 111 acquiring the target text from the similar text determination request includes:
从所述数据信息中获取用于指示文本的信息作为所述目标文本。Information for indicating text is acquired from the data information as the target text.
通过上述实施方式,由于所述相似文本确定请求中存储了所述目标文本,因此,能够从解析得到的所述数据信息中快速获取到所述目标文本。Through the above implementation manner, since the target text is stored in the similar text determination request, the target text can be quickly acquired from the data information obtained by parsing.
生成单元112根据所述待检测文本及所述目标文本生成待检测特征向量,并根据所述待检测文本及所述目标文本生成目标特征向量。The generating unit 112 generates a feature vector to be detected according to the text to be detected and the target text, and generates a target feature vector according to the text to be detected and the target text.
在本申请的至少一个实施例中,所述生成单元112根据所述待检测文本及所述目标文本生成待检测特征向量包括:In at least one embodiment of the present application, the generating unit 112 generates a feature vector to be detected according to the text to be detected and the target text, including:
对所述待检测文本进行分词处理,得到多个待检测分词,并对所述目标文本进行分词处理,得到多个目标分词。Perform word segmentation processing on the text to be detected to obtain multiple word segmentations to be detected, and perform word segmentation processing on the target text to obtain multiple target word segmentations.
所述多个待检测分词可以是多个字词,所述多个目标分词可以是多个字词。The multiple word segments to be detected may be multiple words, and the multiple target word segments may be multiple words.
获取所述多个待检测分词与所述多个目标分词的并集,得到所有分词。Obtain the union of the multiple to-be-detected word segments and the multiple target word segments to obtain all the segmented words.
根据所述多个待检测分词与所述所有分词的映射关系生成所述待检测特征向量。The to-be-detected feature vector is generated according to the mapping relationship between the plurality of to-be-detected word segments and all of the word segments.
所述映射关系是指所述多个待检测分词是否存在于所述所有分词中。The mapping relationship refers to whether the multiple to-be-detected word segments exist in all the word segments.
例如:所述多个待检测分词为:我、立、即、帮、你、申、请、好、吗,所述多个目标分词为:我、没、办、法、帮、您、申、请,因此,所述所有分词为:我、帮、申、请、立、即、你、好、吗、没、办、法、您,由于“没、办、法、您”在所述多个待检测分词中没有出现,因此,所述待检测特征向量为[1,1,1,1,1,1,1,1,1,0,0,0,0]。For example, the multiple word segments to be detected are: I, Immediately, Immediately, Help, You, Apply, Please, Okay, Do, and the multiple target word segments are: I, No, Do, Fa, Help, You, Apply , please, therefore, all the participles mentioned are: I, help, apply, please, immediately, immediately, you, ok, ? The multiple word segments to be detected do not appear, therefore, the feature vector to be detected is [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0].
通过上述实施方式,能够根据所述待检测文本及所述目标文本确定出所述待检测特征向量,由于所述待检测特征向量是根据所述目标文本生成的,因此能够准确地确定出所述待检测特征向量。Through the above implementation, the feature vector to be detected can be determined according to the text to be detected and the target text. Since the feature vector to be detected is generated according to the target text, the feature vector to be detected can be accurately determined. Feature vector to be detected.
在本申请的至少一个实施例中,所述生成单元112根据所述待检测文本及所述目标文本生成目标特征向量包括:In at least one embodiment of the present application, the generating unit 112 generates a target feature vector according to the text to be detected and the target text, including:
根据所述目标分词与所述所有分词的映射关系生成所述目标特征向量。The target feature vector is generated according to the mapping relationship between the target word segment and all word segments.
所述确定单元110计算所述待检测特征向量与所述目标特征向量的相似度,得到所述待检测文本与所述目标文本的文本相似度,并根据所述待检测文本及所述目标文本确定相似系数。The determining unit 110 calculates the similarity between the feature vector to be detected and the target feature vector, obtains the text similarity between the text to be detected and the target text, and according to the text to be detected and the target text Determine the similarity coefficient.
在本申请的至少一个实施例中,所述确定单元110利用余弦相似度计算公式计算所述待检测特征向量与所述目标特征向量的相似度。In at least one embodiment of the present application, the determining unit 110 uses a cosine similarity calculation formula to calculate the similarity between the feature vector to be detected and the target feature vector.
具体的余弦相似度计算公式如下:The specific cosine similarity calculation formula is as follows:
Figure PCTCN2021109391-appb-000003
Figure PCTCN2021109391-appb-000003
其中,cosθ是指所述待检测特征向量与所述目标特征向量的相似度,n是指所述待检测特征向量及所述目标特征向量的向量维度,i是指当前向量维度,x i是指所述待检测特征向量,y i是指所述目标特征向量。 Among them, cosθ refers to the similarity between the feature vector to be detected and the target feature vector, n refers to the vector dimension of the feature vector to be detected and the target feature vector, i refers to the current vector dimension, x i is refers to the feature vector to be detected, and y i refers to the target feature vector.
通过所述余弦相似度计算公式能够快速确定出所述文本相似度。The text similarity can be quickly determined through the cosine similarity calculation formula.
在本申请的至少一个实施例中,所述确定单元110根据所述待检测文本及所述目标文本确定相似系数包括:In at least one embodiment of the present application, the determining unit 110 determining the similarity coefficient according to the text to be detected and the target text includes:
将所述多个待检测分词与所述多个目标分词的交集确定为共现词语;Determining the intersection of the plurality of word segments to be detected and the plurality of target word segments as co-occurring words;
计算所述共现词语的共现数量,并计算所述所有分词的分词总量;Calculate the number of co-occurrences of the co-occurrence words, and calculate the total amount of the word segmentation of all the word segmentation;
将所述共现数量除以所述分词总量,得到所述相似系数。The similarity coefficient is obtained by dividing the co-occurrence number by the total number of word segmentations.
承接上述例子,所述共现词语为我、帮、申、请,计算所述共现词语的共现数量为4,计算所述所有分词的分词总量为13,经计算,得到所述相似系数为
Figure PCTCN2021109391-appb-000004
Following the above example, the co-occurrence words are me, help, application, and request, the co-occurrence number of the co-occurrence words is calculated to be 4, and the total number of word segmentations of all the participles is calculated to be 13. After calculation, the similarity is obtained. The coefficient is
Figure PCTCN2021109391-appb-000004
通过上述实施方式,能够根据所述待检测文本与所述目标文本的共现词语,准确地确定出所述相似系数。Through the above-mentioned embodiments, the similarity coefficient can be accurately determined according to the co-occurrence words of the text to be detected and the target text.
所述确定单元110根据所述待检测文本的语气与所述目标文本的语气确定所述待检测文本与所述目标文本的极性特征。The determining unit 110 determines the polarity characteristics of the text to be detected and the target text according to the tone of the text to be detected and the tone of the target text.
在本申请的至少一个实施例中,所述极性特征包括1或者0。当所述待检测文本的语气与所述目标文本的语气相同时,将所述极性特征确定为1,当所述待检测文本的语气与所述目标文本的语气不同时,将所述极性特征确定为0。In at least one embodiment of the present application, the polarity characteristic includes 1 or 0. When the tone of the text to be detected is in phase with the tone of the target text, the polarity feature is determined to be 1; when the tone of the text to be detected is different from the tone of the target text, the polarity feature is determined as 1. Sex characteristics were determined to be 0.
在本申请的至少一个实施例中,所述确定单元110根据所述待检测文本的语气与所述目标文本的语气确定所述待检测文本与所述目标文本的极性特征包括:In at least one embodiment of the present application, the determining unit 110 determines the polarity features of the text to be detected and the target text according to the tone of the text to be detected and the tone of the target text, including:
检测所述待检测文本中是否包含预设词语,得到第一检测结果,并检测所述目标文本中是否包含所述预设词语,得到第二检测结果,所述预设词语用于指示否定语气;Detecting whether the text to be detected contains a preset word to obtain a first detection result, and detecting whether the target text contains the preset word to obtain a second detection result, where the preset word is used to indicate a negative tone ;
根据所述第一检测结果确定所述待检测文本的第一语气,并根据所述第二检测结果确定所述目标文本的第二语气;Determine the first tone of the text to be detected according to the first detection result, and determine the second tone of the target text according to the second detection result;
若所述第一语气与所述第二语气相同,将所述极性特征确定为第一数值;或者If the first tone is the same as the second tone, determining the polarity feature as a first value; or
若所述第一语气与所述第二语气不同,将所述极性特征确定为第二数值。If the first tone is different from the second tone, the polarity feature is determined as a second value.
其中,所述预设词语包括,但不限于:无、没有、不。Wherein, the preset words include, but are not limited to: none, none, no.
通过上述实施方式,能够根据所述预设词语准确确定出所述待检测文本及所述目标文本的语气,进而能够准确确定出所述极性特征。Through the above-mentioned embodiments, the tone of the text to be detected and the target text can be accurately determined according to the preset words, and then the polarity feature can be accurately determined.
所述生成单元112根据所述文本相似度、所述相似系数及所述极性特征生成所述待检测文本与所述目标文本的文本特征。The generating unit 112 generates text features of the text to be detected and the target text according to the text similarity, the similarity coefficient and the polarity feature.
在本申请的至少一个实施例中,所述文本特征是根据所述文本相似度、所述相似系数及所述极性特征拼接而得到的。In at least one embodiment of the present application, the text feature is obtained by splicing the text similarity, the similarity coefficient and the polarity feature.
例如,所述文本相似度为0.4714,所述相似系数为0.3077,所述极性特征为0,经拼接后,得到所述文本特征为[0.4714,0.3077,0]。For example, the text similarity is 0.4714, the similarity coefficient is 0.3077, and the polarity feature is 0. After splicing, the text feature obtained is [0.4714, 0.3077, 0].
转换单元113将所述待检测文本转换为待检测语义向量,并将所述目标文本转换为目标语义向量。The converting unit 113 converts the text to be detected into a semantic vector to be detected, and converts the target text into a target semantic vector.
在本申请的至少一个实施例中,所述待检测语义向量中包含所述待检测文本中的语义,所述目标语义向量中包含所述目标文本的语义。In at least one embodiment of the present application, the semantic vector to be detected includes the semantics of the text to be detected, and the target semantic vector includes the semantics of the target text.
在本申请的至少一个实施例中,所述转换单元113将所述待检测文本转换为待检测语义向量包括:In at least one embodiment of the present application, the converting unit 113 converts the text to be detected into a semantic vector to be detected including:
将所述待检测文本转换为字向量序列;converting the text to be detected into a sequence of word vectors;
利用正向长短期记忆网络对所述字向量序列进行特征抽取,得到第一特征向量;Use forward long short-term memory network to perform feature extraction on the word vector sequence to obtain a first feature vector;
利用反向长短期记忆网络对所述字向量序列进行特征抽取,得到第二特征向量;Use the reverse long short-term memory network to perform feature extraction on the word vector sequence to obtain a second feature vector;
拼接所述第一特征向量及所述第二特征向量,得到所述待检测语义向量。Splicing the first feature vector and the second feature vector to obtain the to-be-detected semantic vector.
通过上述实施方式,能够使生成的所述待检测语义向量具有所述待检测文本的上下文语义,提高所述待检测语义向量的确定准确度。Through the above-mentioned embodiments, the generated semantic vector to be detected can have the contextual semantics of the text to be detected, and the accuracy of determination of the semantic vector to be detected can be improved.
确定单元110根据所述待检测语义向量及所述目标语义向量生成所述待检测文本与所述目标文本的语义特征,并根据所述文本特征及所述语义特征确定所述待检测文本与所述目标文本的相似结果。The determining unit 110 generates the semantic features of the text to be detected and the target text according to the semantic vector to be detected and the target semantic vector, and determines the text to be detected and the target text according to the text features and the semantic features. similar results for the target text.
需要强调的是,为进一步保证上述相似结果的私密和安全性,上述相似结果还可以存储于一区块链的节点中。It should be emphasized that, in order to further ensure the privacy and security of the above similar results, the above similar results can also be stored in a node of a blockchain.
在本申请的至少一个实施例中,所述相似结果包括所述待检测文本与所述目标文本相似,所述待检测文本与所述目标文本不相似。In at least one embodiment of the present application, the similarity result includes that the text to be detected is similar to the target text, and the text to be detected is not similar to the target text.
在本申请的至少一个实施例中,所述确定单元110根据所述待检测语义向量及所述目标语义向量生成所述待检测文本与所述目标文本的语义特征包括:In at least one embodiment of the present application, the determining unit 110 generates the semantic features of the text to be detected and the target text according to the semantic vector to be detected and the target semantic vector, including:
将所述待检测语义向量减去所述目标语义向量,得到差向量。Subtract the target semantic vector from the to-be-detected semantic vector to obtain a difference vector.
拼接所述待检测语义向量、所述目标语义向量及所述差向量,得到拼接语义向量。Splicing the to-be-detected semantic vector, the target semantic vector and the difference vector to obtain a spliced semantic vector.
利用预先构建好的多层隐层对所述拼接语义向量进行迭代映射,得到所述语义特征。The spliced semantic vector is iteratively mapped by using a pre-built multi-layer hidden layer to obtain the semantic feature.
通过上述实施方式,由于所述语义特征是根据所述待检测语义向量与所述目标语义向量的运算而得到的,因此,能够使所述语义特征中具有所述待检测文本及所述目标文本中的语义,提高了所述语义特征的准确度。Through the above-mentioned embodiment, since the semantic feature is obtained according to the operation of the to-be-detected semantic vector and the target semantic vector, the semantic feature can include the to-be-detected text and the target text The semantics in the semantics improves the accuracy of the semantic features.
在本申请的至少一个实施例中,所述确定单元110根据所述文本特征及所述语义特征确定所述待检测文本与所述目标文本的相似结果包括:In at least one embodiment of the present application, the determining unit 110 determines the similarity result between the text to be detected and the target text according to the text feature and the semantic feature, including:
对所述文本特征及所述语义特征进行拼接,得到目标向量;Splicing the text feature and the semantic feature to obtain a target vector;
将所述目标向量输入至预先构建好的二分类网络中,得到所述相似结果。Input the target vector into a pre-built binary classification network to obtain the similar result.
通过上述实施方式,由于所述相似结果是利用所述文本特征及所述语义特征确定的,因此,能够准确确定出所述相似结果。Through the above implementation manner, since the similar results are determined by using the text features and the semantic features, the similar results can be accurately determined.
由以上技术方案可以看出,本申请通过确定所述待检测文本与所述目标文本的文本相似度、相似系数以及极性特征,由于所述极性特征能够表征出所述待检测文本与所述目标文本的语气是否相同,因此,能够准确确定所述待检测文本与所述目标文本的相似 程度,通过所述语义特征的确定,避免了同词异义或者同义异词的出现造成的准确度低下的问题,进而通过所述文本特征及所述语义特征能够准确确定出所述待检测文本与所述目标文本的相似结果。It can be seen from the above technical solutions that the present application determines the text similarity, similarity coefficient and polarity feature of the text to be detected and the target text, because the polarity feature can characterize the text to be detected and the target text. Whether the tone of the target text is the same, therefore, the degree of similarity between the text to be detected and the target text can be accurately determined. The problem of low accuracy is solved, and the similarity result between the text to be detected and the target text can be accurately determined through the text feature and the semantic feature.
如图5所示,是本申请实现相似文本确定方法的较佳实施例的电子设备的结构示意图。As shown in FIG. 5 , it is a schematic structural diagram of an electronic device implementing a preferred embodiment of the method for determining similar texts of the present application.
在本申请的一个实施例中,所述电子设备1包括,但不限于,存储器12、处理器13,以及存储在所述存储器12中并可在所述处理器13上运行的计算机可读指令,例如相似文本确定程序。In one embodiment of the present application, the electronic device 1 includes, but is not limited to, a memory 12 , a processor 13 , and computer-readable instructions stored in the memory 12 and executable on the processor 13 , such as similar text determination programs.
本领域技术人员可以理解,所述示意图仅仅是电子设备1的示例,并不构成对电子设备1的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述电子设备1还可以包括输入输出设备、网络接入设备、总线等。Those skilled in the art can understand that the schematic diagram is only an example of the electronic device 1, and does not constitute a limitation on the electronic device 1, and may include more or less components than the one shown, or combine some components, or different Components, for example, the electronic device 1 may also include input and output devices, network access devices, buses, and the like.
所述处理器13可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等,所述处理器13是所述电子设备1的运算核心和控制中心,利用各种接口和线路连接整个电子设备1的各个部分,及执行所述电子设备1的操作系统以及安装的各类应用程序、程序代码等。The processor 13 may be a central processing unit (Central Processing Unit, CPU), or other general-purpose processors, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or the processor can also be any conventional processor, etc. The processor 13 is the computing core and control center of the electronic device 1, and uses various interfaces and lines to connect the entire electronic device. 1, and the operating system that executes the electronic device 1, as well as various installed applications, program codes, and the like.
示例性的,所述计算机可读指令可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在所述存储器12中,并由所述处理器13执行,以完成本申请。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机可读指令段,该计算机可读指令段用于描述所述计算机可读指令在所述电子设备1中的执行过程。例如,所述计算机可读指令可以被分割成确定单元110、获取单元111、生成单元112及转换单元113。Exemplarily, the computer-readable instructions may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 12 and executed by the processor 13 to Complete this application. The one or more modules/units may be a series of computer-readable instruction segments capable of accomplishing specific functions, and the computer-readable instruction segments are used to describe the execution process of the computer-readable instructions in the electronic device 1 . For example, the computer readable instructions may be divided into a determining unit 110 , an obtaining unit 111 , a generating unit 112 and a converting unit 113 .
所述存储器12可用于存储所述计算机可读指令和/或模块,所述处理器13通过运行或执行存储在所述存储器12内的计算机可读指令和/或模块,以及调用存储在存储器12内的数据,实现所述电子设备1的各种功能。所述存储器12可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据电子设备的使用所创建的数据等。存储器12可以包括非易失性和易失性存储器,例如:硬盘、内存、插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)、至少一个磁盘存储器件、闪存器件、或其他存储器件。The memory 12 can be used to store the computer-readable instructions and/or modules, and the processor 13 executes or executes the computer-readable instructions and/or modules stored in the memory 12 and invokes the computer-readable instructions and/or modules stored in the memory 12. The data in the electronic device 1 realizes various functions of the electronic device 1 . The memory 12 may mainly include a stored program area and a stored data area, wherein the stored program area may store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), and the like; the storage data area may Data and the like created according to the use of the electronic device are stored. The memory 12 may include non-volatile and volatile memory such as: hard disk, internal memory, plug-in hard disk, Smart Media Card (SMC), Secure Digital (SD) card, Flash memory card (Flash) Card), at least one disk storage device, flash memory device, or other storage device.
所述存储器12可以是电子设备1的外部存储器和/或内部存储器。进一步地,所述存储器12可以是具有实物形式的存储器,如内存条、TF卡(Trans-flash Card)等等。The memory 12 may be an external memory and/or an internal memory of the electronic device 1 . Further, the storage 12 may be a storage in physical form, such as a memory stick, a TF card (Trans-flash Card) and the like.
所述电子设备1集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。所述计算机可读存储介质可以是非易失性的存储介质,也可以是易失性的存储介质。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,也可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一计算机可读存储介质中,该计算机可读指令在 被处理器执行时,可实现上述各个方法实施例的步骤。If the modules/units integrated in the electronic device 1 are implemented in the form of software functional units and sold or used as independent products, they may be stored in a computer-readable storage medium. The computer-readable storage medium may be a non-volatile storage medium or a volatile storage medium. Based on this understanding, the present application can implement all or part of the processes in the methods of the above embodiments, and can also be completed by instructing relevant hardware through computer-readable instructions, and the computer-readable instructions can be stored in a computer-readable storage medium. The computer-readable instructions, when executed by the processor, can implement the steps of the above-mentioned method embodiments.
其中,所述计算机可读指令包括计算机可读指令代码,所述计算机可读指令代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机可读指令代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)。Wherein, the computer-readable instructions include computer-readable instruction codes, and the computer-readable instruction codes may be in source code form, object code form, executable file, or some intermediate form, and the like. The computer-readable medium may include: any entity or device capable of carrying the computer-readable instruction code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM, Read-Only). Memory), random access memory (RAM, Random Access Memory).
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。The blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
结合图1,所述电子设备1中的所述存储器12存储计算机可读指令实现一种相似文本确定方法,所述处理器13可执行所述计算机可读指令从而实现:1, the memory 12 in the electronic device 1 stores computer-readable instructions to implement a method for determining similar text, and the processor 13 can execute the computer-readable instructions to implement:
接收相似文本确定请求,并根据所述相似文本确定请求确定待检测文本;receiving a similar text determination request, and determining the text to be detected according to the similar text determination request;
从所述相似文本确定请求中获取目标文本;obtain the target text from the similar text determination request;
根据所述待检测文本及所述目标文本生成待检测特征向量,并根据所述待检测文本及所述目标文本生成目标特征向量;Generate a feature vector to be detected according to the text to be detected and the target text, and generate a target feature vector according to the text to be detected and the target text;
计算所述待检测特征向量与所述目标特征向量的相似度,得到所述待检测文本与所述目标文本的文本相似度,并根据所述待检测文本及所述目标文本确定相似系数;Calculate the similarity between the feature vector to be detected and the target feature vector, obtain the text similarity between the text to be detected and the target text, and determine a similarity coefficient according to the text to be detected and the target text;
根据所述待检测文本的语气与所述目标文本的语气确定所述待检测文本与所述目标文本的极性特征;Determine the polarity feature of the text to be detected and the target text according to the tone of the text to be detected and the tone of the target text;
根据所述文本相似度、所述相似系数及所述极性特征生成所述待检测文本与所述目标文本的文本特征;Generate text features of the text to be detected and the target text according to the text similarity, the similarity coefficient and the polarity feature;
将所述待检测文本转换为待检测语义向量,并将所述目标文本转换为目标语义向量;Converting the text to be detected into a semantic vector to be detected, and converting the target text into a target semantic vector;
根据所述待检测语义向量及所述目标语义向量生成所述待检测文本与所述目标文本的语义特征,并根据所述文本特征及所述语义特征确定所述待检测文本与所述目标文本的相似结果。Generate semantic features of the text to be detected and the target text according to the semantic vector to be detected and the target semantic vector, and determine the text to be detected and the target text according to the text features and the semantic features similar results.
具体地,所述处理器13对上述计算机可读指令的具体实现方法可参考图1对应实施例中相关步骤的描述,在此不赘述。Specifically, for the specific implementation method of the computer-readable instruction by the processor 13, reference may be made to the description of the relevant steps in the embodiment corresponding to FIG. 1 , which is not repeated here.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。In the several embodiments provided in this application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative. For example, the division of the modules is only a logical function division, and there may be other division manners in actual implementation.
所述计算机可读存储介质上存储有计算机可读指令,其中,所述计算机可读指令被处理器13执行时用以实现以下步骤:The computer-readable storage medium stores computer-readable instructions, wherein the computer-readable instructions are used to implement the following steps when executed by the processor 13:
接收相似文本确定请求,并根据所述相似文本确定请求确定待检测文本;receiving a similar text determination request, and determining the text to be detected according to the similar text determination request;
从所述相似文本确定请求中获取目标文本;obtain the target text from the similar text determination request;
根据所述待检测文本及所述目标文本生成待检测特征向量,并根据所述待检测文本及所述目标文本生成目标特征向量;Generate a feature vector to be detected according to the text to be detected and the target text, and generate a target feature vector according to the text to be detected and the target text;
计算所述待检测特征向量与所述目标特征向量的相似度,得到所述待检测文本与所 述目标文本的文本相似度,并根据所述待检测文本及所述目标文本确定相似系数;Calculate the similarity of the feature vector to be detected and the target feature vector, obtain the text similarity of the text to be detected and the target text, and determine a similarity coefficient according to the text to be detected and the target text;
根据所述待检测文本的语气与所述目标文本的语气确定所述待检测文本与所述目标文本的极性特征;Determine the polarity feature of the text to be detected and the target text according to the tone of the text to be detected and the tone of the target text;
根据所述文本相似度、所述相似系数及所述极性特征生成所述待检测文本与所述目标文本的文本特征;Generate text features of the text to be detected and the target text according to the text similarity, the similarity coefficient and the polarity feature;
将所述待检测文本转换为待检测语义向量,并将所述目标文本转换为目标语义向量;Converting the text to be detected into a semantic vector to be detected, and converting the target text into a target semantic vector;
根据所述待检测语义向量及所述目标语义向量生成所述待检测文本与所述目标文本的语义特征,并根据所述文本特征及所述语义特征确定所述待检测文本与所述目标文本的相似结果。Generate semantic features of the text to be detected and the target text according to the semantic vector to be detected and the target semantic vector, and determine the text to be detected and the target text according to the text features and the semantic features similar results.
所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。The modules described as separate components may or may not be physically separated, and components shown as modules may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
另外,在本申请各个实施例中的各功能模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能模块的形式实现。In addition, each functional module in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware, or can be implemented in the form of hardware plus software function modules.
因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本申请的范围由所附权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化涵括在本申请内。不应将权利要求中的任何附关联图标记视为限制所涉及的权利要求。Accordingly, the embodiments are to be regarded in all respects as illustrative and not restrictive, and the scope of the application is to be defined by the appended claims rather than the foregoing description, which is therefore intended to fall within the scope of the claims. All changes within the meaning and scope of the equivalents of , are included in this application. Any reference signs in the claims shall not be construed as limiting the involved claim.
此外,显然“包括”一词不排除其他单元或步骤,单数不排除复数。所述的多个单元或装置也可以由一个单元或装置通过软件或者硬件来实现。第一、第二等词语用来表示名称,而并不表示任何特定的顺序。Furthermore, it is clear that the word "comprising" does not exclude other units or steps and the singular does not exclude the plural. The multiple units or devices described may also be implemented by one unit or device through software or hardware. The words first, second, etc. are used to denote names and do not denote any particular order.
最后应说明的是,以上实施例仅用以说明本申请的技术方案而非限制,尽管参照较佳实施例对本申请进行了详细说明,本领域的普通技术人员应当理解,可以对本申请的技术方案进行修改或等同替换,而不脱离本申请技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present application rather than limitations. Although the present application has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the present application can be Modifications or equivalent substitutions can be made without departing from the spirit and scope of the technical solutions of the present application.

Claims (20)

  1. 一种相似文本确定方法,其中,所述相似文本确定方法包括:A similar text determination method, wherein the similar text determination method comprises:
    接收相似文本确定请求,并根据所述相似文本确定请求确定待检测文本;receiving a similar text determination request, and determining the text to be detected according to the similar text determination request;
    从所述相似文本确定请求中获取目标文本;obtain the target text from the similar text determination request;
    对所述待检测文本进行分词处理,得到多个待检测分词,并对所述目标文本进行分词处理,得到多个目标分词;Perform word segmentation processing on the text to be detected to obtain multiple word segmentations to be detected, and perform word segmentation processing on the target text to obtain multiple target word segmentations;
    获取所述多个待检测分词与所述多个目标分词的并集,得到所有分词;Obtaining the union of the plurality of word segments to be detected and the plurality of target word segments to obtain all word segments;
    根据所述多个待检测分词及所述多个目标分词生成待检测特征向量,并根据所述多个待检测分词及所述多个目标分词生成目标特征向量;Generate a feature vector to be detected according to the plurality of word segmentations to be detected and the plurality of target word segmentations, and generate a target feature vector according to the plurality of word segmentations to be detected and the plurality of target word segmentations;
    计算所述待检测特征向量与所述目标特征向量的相似度,得到所述待检测文本与所述目标文本的文本相似度,并将所述多个待检测分词与所述多个目标分词的交集确定为共现词语;Calculate the similarity between the to-be-detected feature vector and the target feature vector, obtain the text similarity between the to-be-detected text and the target text, and compare the multiple to-be-detected word segmentations with the multiple target word segmentations. The intersection is determined as a co-occurring word;
    计算所述共现词语的共现数量,并计算所述所有分词的分词总量;Calculate the number of co-occurrences of the co-occurrence words, and calculate the total amount of the word segmentation of all the word segmentation;
    将所述共现数量除以所述分词总量,得到相似系数;Divide the number of co-occurrences by the total amount of word segmentation to obtain a similarity coefficient;
    根据所述待检测文本的语气与所述目标文本的语气确定所述待检测文本与所述目标文本的极性特征;Determine the polarity feature of the text to be detected and the target text according to the tone of the text to be detected and the tone of the target text;
    根据所述文本相似度、所述相似系数及所述极性特征生成所述待检测文本与所述目标文本的文本特征;Generate text features of the text to be detected and the target text according to the text similarity, the similarity coefficient and the polarity feature;
    将所述待检测文本转换为待检测语义向量,并将所述目标文本转换为目标语义向量;Converting the text to be detected into a semantic vector to be detected, and converting the target text into a target semantic vector;
    根据所述待检测语义向量及所述目标语义向量生成所述待检测文本与所述目标文本的语义特征,并根据所述文本特征及所述语义特征确定所述待检测文本与所述目标文本的相似结果。Generate semantic features of the text to be detected and the target text according to the semantic vector to be detected and the target semantic vector, and determine the text to be detected and the target text according to the text features and the semantic features similar results.
  2. 根据权利要求1所述的相似文本确定方法,其中,所述根据所述相似文本确定请求确定待检测文本包括:The method for determining similar texts according to claim 1, wherein the determining the text to be detected according to the similar text determination request comprises:
    解析所述相似文本确定请求的报文,得到所述报文携带的数据信息;Parsing the similar text to determine the requested message, and obtaining data information carried by the message;
    从所述数据信息中获取用于指示位置的信息作为存储位置;Obtain information for indicating a location from the data information as a storage location;
    从所述存储位置中确定待检测文本库,并从所述待检测文本库中提取任意文本作为所述待检测文本。A to-be-detected text library is determined from the storage location, and any text is extracted from the to-be-detected text library as the to-be-detected text.
  3. 根据权利要求1所述的相似文本确定方法,其中,所述根据所述多个待检测分词及所述多个目标分词生成待检测特征向量包括:The method for determining similar texts according to claim 1, wherein the generating a feature vector to be detected according to the plurality of word segments to be detected and the plurality of target word segments comprises:
    根据所述多个待检测分词与所述所有分词的映射关系生成所述待检测特征向量。The to-be-detected feature vector is generated according to the mapping relationship between the plurality of to-be-detected word segments and all of the word segments.
  4. 根据权利要求1所述的相似文本确定方法,其中,所述根据所述待检测文本的语气与所述目标文本的语气确定所述待检测文本与所述目标文本的极性特征包括:The method for determining similar texts according to claim 1, wherein the determining the polarity features of the text to be detected and the target text according to the tone of the text to be detected and the tone of the target text comprises:
    检测所述待检测文本中是否包含预设词语,得到第一检测结果,并检测所述目标文本中是否包含所述预设词语,得到第二检测结果,所述预设词语用于指示否定语气;Detecting whether the text to be detected contains a preset word to obtain a first detection result, and detecting whether the target text contains the preset word to obtain a second detection result, where the preset word is used to indicate a negative tone ;
    根据所述第一检测结果确定所述待检测文本的第一语气,并根据所述第二检测结果 确定所述目标文本的第二语气;Determine the first tone of the text to be detected according to the first detection result, and determine the second tone of the target text according to the second detection result;
    若所述第一语气与所述第二语气相同,将所述极性特征确定为第一数值;或者If the first tone is the same as the second tone, determining the polarity feature as a first value; or
    若所述第一语气与所述第二语气不同,将所述极性特征确定为第二数值。If the first tone is different from the second tone, the polarity feature is determined as a second value.
  5. 根据权利要求1所述的相似文本确定方法,其中,所述将所述待检测文本转换为待检测语义向量包括:The method for determining similar texts according to claim 1, wherein the converting the text to be detected into a semantic vector to be detected comprises:
    将所述待检测文本转换为字向量序列;converting the text to be detected into a sequence of word vectors;
    利用正向长短期记忆网络对所述字向量序列进行特征抽取,得到第一特征向量;Use forward long short-term memory network to perform feature extraction on the word vector sequence to obtain a first feature vector;
    利用反向长短期记忆网络对所述字向量序列进行特征抽取,得到第二特征向量;Use the reverse long short-term memory network to perform feature extraction on the word vector sequence to obtain a second feature vector;
    拼接所述第一特征向量及所述第二特征向量,得到所述待检测语义向量。Splicing the first feature vector and the second feature vector to obtain the to-be-detected semantic vector.
  6. 根据权利要求1所述的相似文本确定方法,其中,所述根据所述待检测语义向量及所述目标语义向量生成所述待检测文本与所述目标文本的语义特征包括:The method for determining similar texts according to claim 1, wherein the generating the semantic features of the text to be detected and the target text according to the semantic vector to be detected and the target semantic vector comprises:
    将所述待检测语义向量减去所述目标语义向量,得到差向量;Subtract the target semantic vector from the to-be-detected semantic vector to obtain a difference vector;
    拼接所述待检测语义向量、所述目标语义向量及所述差向量,得到拼接语义向量;Splicing the semantic vector to be detected, the target semantic vector and the difference vector to obtain a splicing semantic vector;
    利用预先构建好的多层隐层对所述拼接语义向量进行迭代映射,得到所述语义特征。The spliced semantic vector is iteratively mapped by using a pre-built multi-layer hidden layer to obtain the semantic feature.
  7. 一种相似文本确定装置,其中,所述相似文本确定装置包括:A similar text determination device, wherein the similar text determination device comprises:
    确定单元,用于接收相似文本确定请求,并根据所述相似文本确定请求确定待检测文本;a determination unit, configured to receive a similar text determination request, and determine the text to be detected according to the similar text determination request;
    获取单元,用于从所述相似文本确定请求中获取目标文本;an obtaining unit for obtaining the target text from the similar text determination request;
    生成单元,用于对所述待检测文本进行分词处理,得到多个待检测分词,并对所述目标文本进行分词处理,得到多个目标分词;a generating unit, configured to perform word segmentation processing on the text to be detected to obtain a plurality of word segmentations to be detected, and perform word segmentation processing on the target text to obtain a plurality of target word segmentations;
    所述生成单元,还用于获取所述多个待检测分词与所述多个目标分词的并集,得到所有分词;The generating unit is also used to obtain the union of the plurality of word segmentations to be detected and the plurality of target word segmentations to obtain all word segmentations;
    所述生成单元,还用于根据所述多个待检测分词及所述多个目标分词生成待检测特征向量,并根据所述多个待检测分词及所述多个目标分词生成目标特征向量;The generating unit is further configured to generate a feature vector to be detected according to the plurality of word segmentations to be detected and the plurality of target word segmentations, and to generate a target feature vector according to the plurality of word segmentations to be detected and the plurality of target word segmentations;
    所述确定单元,还用于计算所述待检测特征向量与所述目标特征向量的相似度,得到所述待检测文本与所述目标文本的文本相似度,并将所述多个待检测分词与所述多个目标分词的交集确定为共现词语;The determining unit is further configured to calculate the similarity between the to-be-detected feature vector and the target feature vector, obtain the text similarity between the to-be-detected text and the target text, and divide the plurality of to-be-detected word segmentations. The intersection with the multiple target word segments is determined as a co-occurrence word;
    所述确定单元,还用于计算所述共现词语的共现数量,并计算所述所有分词的分词总量;The determining unit is also used to calculate the co-occurrence quantity of the co-occurrence words, and calculate the total amount of the word segmentation of all the word segmentations;
    所述确定单元,还用于将所述共现数量除以所述分词总量,得到相似系数;The determining unit is further configured to divide the co-occurrence number by the total amount of word segmentation to obtain a similarity coefficient;
    所述确定单元,还用于根据所述待检测文本的语气与所述目标文本的语气确定所述待检测文本与所述目标文本的极性特征;The determining unit is further configured to determine the polarity features of the text to be detected and the target text according to the tone of the text to be detected and the tone of the target text;
    所述生成单元,还用于根据所述文本相似度、所述相似系数及所述极性特征生成所述待检测文本与所述目标文本的文本特征;The generating unit is further configured to generate text features of the text to be detected and the target text according to the text similarity, the similarity coefficient and the polarity feature;
    转换单元,用于将所述待检测文本转换为待检测语义向量,并将所述目标文本转换为目标语义向量;a conversion unit, for converting the text to be detected into a semantic vector to be detected, and converting the target text into a target semantic vector;
    所述确定单元,还用于根据所述待检测语义向量及所述目标语义向量生成所述待检测文本与所述目标文本的语义特征,并根据所述文本特征及所述语义特征确定所述待检测文本与所述目标文本的相似结果。The determining unit is further configured to generate semantic features of the text to be detected and the target text according to the semantic vector to be detected and the target semantic vector, and determine the text feature and the semantic feature to determine the Similar results between the text to be detected and the target text.
  8. 根据权利要求7所述的相似文本确定装置,其中,所述转换单元将所述待检测文本转换为待检测语义向量包括:The apparatus for determining similar texts according to claim 7, wherein the converting unit to convert the text to be detected into a semantic vector to be detected comprises:
    将所述待检测文本转换为字向量序列;converting the text to be detected into a sequence of word vectors;
    利用正向长短期记忆网络对所述字向量序列进行特征抽取,得到第一特征向量;Use forward long short-term memory network to perform feature extraction on the word vector sequence to obtain a first feature vector;
    利用反向长短期记忆网络对所述字向量序列进行特征抽取,得到第二特征向量;Use the reverse long short-term memory network to perform feature extraction on the word vector sequence to obtain a second feature vector;
    拼接所述第一特征向量及所述第二特征向量,得到所述待检测语义向量。Splicing the first feature vector and the second feature vector to obtain the to-be-detected semantic vector.
  9. 一种电子设备,其中,所述电子设备包括处理器和存储器,所述处理器用于执行存储器中存储的至少一个计算机可读指令以实现以下步骤:An electronic device, wherein the electronic device includes a processor and a memory, and the processor is configured to execute at least one computer-readable instruction stored in the memory to implement the following steps:
    接收相似文本确定请求,并根据所述相似文本确定请求确定待检测文本;receiving a similar text determination request, and determining the text to be detected according to the similar text determination request;
    从所述相似文本确定请求中获取目标文本;obtain the target text from the similar text determination request;
    对所述待检测文本进行分词处理,得到多个待检测分词,并对所述目标文本进行分词处理,得到多个目标分词;Perform word segmentation processing on the text to be detected to obtain multiple word segmentations to be detected, and perform word segmentation processing on the target text to obtain multiple target word segmentations;
    获取所述多个待检测分词与所述多个目标分词的并集,得到所有分词;Obtaining the union of the plurality of word segments to be detected and the plurality of target word segments to obtain all word segments;
    根据所述多个待检测分词及所述多个目标分词生成待检测特征向量,并根据所述多个待检测分词及所述多个目标分词生成目标特征向量;Generate a feature vector to be detected according to the plurality of word segmentations to be detected and the plurality of target word segmentations, and generate a target feature vector according to the plurality of word segmentations to be detected and the plurality of target word segmentations;
    计算所述待检测特征向量与所述目标特征向量的相似度,得到所述待检测文本与所述目标文本的文本相似度,并将所述多个待检测分词与所述多个目标分词的交集确定为共现词语;Calculate the similarity between the to-be-detected feature vector and the target feature vector, obtain the text similarity between the to-be-detected text and the target text, and compare the multiple to-be-detected word segmentations with the multiple target word segmentations. The intersection is determined as a co-occurring word;
    计算所述共现词语的共现数量,并计算所述所有分词的分词总量;Calculate the number of co-occurrences of the co-occurrence words, and calculate the total amount of the word segmentation of all the word segmentation;
    将所述共现数量除以所述分词总量,得到相似系数;Divide the number of co-occurrences by the total amount of word segmentation to obtain a similarity coefficient;
    根据所述待检测文本的语气与所述目标文本的语气确定所述待检测文本与所述目标文本的极性特征;Determine the polarity feature of the text to be detected and the target text according to the tone of the text to be detected and the tone of the target text;
    根据所述文本相似度、所述相似系数及所述极性特征生成所述待检测文本与所述目标文本的文本特征;Generate text features of the text to be detected and the target text according to the text similarity, the similarity coefficient and the polarity feature;
    将所述待检测文本转换为待检测语义向量,并将所述目标文本转换为目标语义向量;Converting the text to be detected into a semantic vector to be detected, and converting the target text into a target semantic vector;
    根据所述待检测语义向量及所述目标语义向量生成所述待检测文本与所述目标文本的语义特征,并根据所述文本特征及所述语义特征确定所述待检测文本与所述目标文本的相似结果。Generate semantic features of the text to be detected and the target text according to the semantic vector to be detected and the target semantic vector, and determine the text to be detected and the target text according to the text features and the semantic features similar results.
  10. 根据权利要求9所述的电子设备,其中,在所述根据所述相似文本确定请求确定待检测文本时,所述处理器执行所述至少一个计算机可读指令以实现以下步骤:The electronic device according to claim 9, wherein, when the text to be detected is determined according to the similar text determination request, the processor executes the at least one computer-readable instruction to implement the following steps:
    解析所述相似文本确定请求的报文,得到所述报文携带的数据信息;Parsing the similar text to determine the requested message, and obtaining data information carried by the message;
    从所述数据信息中获取用于指示位置的信息作为存储位置;Obtain information for indicating a location from the data information as a storage location;
    从所述存储位置中确定待检测文本库,并从所述待检测文本库中提取任意文本作为所述待检测文本。A to-be-detected text library is determined from the storage location, and any text is extracted from the to-be-detected text library as the to-be-detected text.
  11. 根据权利要求9所述的电子设备,其中,在所述根据所述多个待检测分词及所述多个目标分词生成待检测特征向量时,所述处理器执行所述至少一个计算机可读指令以实现以下步骤:The electronic device according to claim 9, wherein the processor executes the at least one computer-readable instruction when the to-be-detected feature vector is generated according to the plurality of to-be-detected word segments and the plurality of target word segments to implement the following steps:
    根据所述多个待检测分词与所述所有分词的映射关系生成所述待检测特征向量。The to-be-detected feature vector is generated according to the mapping relationship between the plurality of to-be-detected word segments and all of the word segments.
  12. 根据权利要求9所述的电子设备,其中,在所述根据所述待检测文本的语气与所述 目标文本的语气确定所述待检测文本与所述目标文本的极性特征时,所述处理器执行所述至少一个计算机可读指令以实现以下步骤:The electronic device according to claim 9, wherein, when determining the polarity features of the text to be detected and the target text according to the tone of the text to be detected and the tone of the target text, the processing The processor executes the at least one computer-readable instruction to implement the following steps:
    检测所述待检测文本中是否包含预设词语,得到第一检测结果,并检测所述目标文本中是否包含所述预设词语,得到第二检测结果,所述预设词语用于指示否定语气;Detecting whether the text to be detected contains a preset word to obtain a first detection result, and detecting whether the target text contains the preset word to obtain a second detection result, where the preset word is used to indicate a negative tone ;
    根据所述第一检测结果确定所述待检测文本的第一语气,并根据所述第二检测结果确定所述目标文本的第二语气;Determine the first tone of the text to be detected according to the first detection result, and determine the second tone of the target text according to the second detection result;
    若所述第一语气与所述第二语气相同,将所述极性特征确定为第一数值;或者If the first tone is the same as the second tone, determining the polarity feature as a first value; or
    若所述第一语气与所述第二语气不同,将所述极性特征确定为第二数值。If the first tone is different from the second tone, the polarity feature is determined as a second value.
  13. 根据权利要求9所述的电子设备,其中,在所述将所述待检测文本转换为待检测语义向量时,所述处理器执行所述至少一个计算机可读指令以实现以下步骤:The electronic device according to claim 9, wherein, when converting the text to be detected into a semantic vector to be detected, the processor executes the at least one computer-readable instruction to implement the following steps:
    将所述待检测文本转换为字向量序列;converting the text to be detected into a sequence of word vectors;
    利用正向长短期记忆网络对所述字向量序列进行特征抽取,得到第一特征向量;Use forward long short-term memory network to perform feature extraction on the word vector sequence to obtain a first feature vector;
    利用反向长短期记忆网络对所述字向量序列进行特征抽取,得到第二特征向量;Use the reverse long short-term memory network to perform feature extraction on the word vector sequence to obtain a second feature vector;
    拼接所述第一特征向量及所述第二特征向量,得到所述待检测语义向量。Splicing the first feature vector and the second feature vector to obtain the to-be-detected semantic vector.
  14. 根据权利要求9所述的电子设备,其中,在所述根据所述待检测语义向量及所述目标语义向量生成所述待检测文本与所述目标文本的语义特征时,所述处理器执行所述至少一个计算机可读指令以实现以下步骤:The electronic device according to claim 9, wherein when the semantic features of the text to be detected and the target text are generated according to the semantic vector to be detected and the target semantic vector, the processor executes the at least one computer-readable instruction to implement the following steps:
    将所述待检测语义向量减去所述目标语义向量,得到差向量;Subtract the target semantic vector from the to-be-detected semantic vector to obtain a difference vector;
    拼接所述待检测语义向量、所述目标语义向量及所述差向量,得到拼接语义向量;Splicing the semantic vector to be detected, the target semantic vector and the difference vector to obtain a splicing semantic vector;
    利用预先构建好的多层隐层对所述拼接语义向量进行迭代映射,得到所述语义特征。The spliced semantic vector is iteratively mapped by using a pre-built multi-layer hidden layer to obtain the semantic feature.
  15. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有至少一个计算机可读指令,所述至少一个计算机可读指令被处理器执行时实现以下步骤:A computer-readable storage medium, wherein the computer-readable storage medium stores at least one computer-readable instruction, and the at least one computer-readable instruction implements the following steps when executed by a processor:
    接收相似文本确定请求,并根据所述相似文本确定请求确定待检测文本;receiving a similar text determination request, and determining the text to be detected according to the similar text determination request;
    从所述相似文本确定请求中获取目标文本;obtain the target text from the similar text determination request;
    对所述待检测文本进行分词处理,得到多个待检测分词,并对所述目标文本进行分词处理,得到多个目标分词;Perform word segmentation processing on the text to be detected to obtain multiple word segmentations to be detected, and perform word segmentation processing on the target text to obtain multiple target word segmentations;
    获取所述多个待检测分词与所述多个目标分词的并集,得到所有分词;Obtaining the union of the plurality of word segments to be detected and the plurality of target word segments to obtain all word segments;
    根据所述多个待检测分词及所述多个目标分词生成待检测特征向量,并根据所述多个待检测分词及所述多个目标分词生成目标特征向量;Generate a feature vector to be detected according to the plurality of word segmentations to be detected and the plurality of target word segmentations, and generate a target feature vector according to the plurality of word segmentations to be detected and the plurality of target word segmentations;
    计算所述待检测特征向量与所述目标特征向量的相似度,得到所述待检测文本与所述目标文本的文本相似度,并将所述多个待检测分词与所述多个目标分词的交集确定为共现词语;Calculate the similarity between the to-be-detected feature vector and the target feature vector, obtain the text similarity between the to-be-detected text and the target text, and compare the multiple to-be-detected word segmentations with the multiple target word segmentations. The intersection is determined as a co-occurring word;
    计算所述共现词语的共现数量,并计算所述所有分词的分词总量;Calculate the number of co-occurrences of the co-occurrence words, and calculate the total amount of the word segmentation of all the word segmentation;
    将所述共现数量除以所述分词总量,得到相似系数;Divide the number of co-occurrences by the total amount of word segmentation to obtain a similarity coefficient;
    根据所述待检测文本的语气与所述目标文本的语气确定所述待检测文本与所述目标文本的极性特征;Determine the polarity feature of the text to be detected and the target text according to the tone of the text to be detected and the tone of the target text;
    根据所述文本相似度、所述相似系数及所述极性特征生成所述待检测文本与所述目标文本的文本特征;Generate text features of the text to be detected and the target text according to the text similarity, the similarity coefficient and the polarity feature;
    将所述待检测文本转换为待检测语义向量,并将所述目标文本转换为目标语义向量;Converting the text to be detected into a semantic vector to be detected, and converting the target text into a target semantic vector;
    根据所述待检测语义向量及所述目标语义向量生成所述待检测文本与所述目标文本的语义特征,并根据所述文本特征及所述语义特征确定所述待检测文本与所述目标文本的相似结果。Generate semantic features of the text to be detected and the target text according to the semantic vector to be detected and the target semantic vector, and determine the text to be detected and the target text according to the text features and the semantic features similar results.
  16. 根据权利要求15所述的存储介质,其中,在所述根据所述相似文本确定请求确定待检测文本时,所述至少一个计算机可读指令被处理器执行以实现以下步骤:16. The storage medium of claim 15, wherein when the text to be detected is determined according to the similar text determination request, the at least one computer-readable instruction is executed by a processor to implement the following steps:
    解析所述相似文本确定请求的报文,得到所述报文携带的数据信息;Parsing the similar text to determine the requested message, and obtaining data information carried by the message;
    从所述数据信息中获取用于指示位置的信息作为存储位置;Obtain information for indicating a location from the data information as a storage location;
    从所述存储位置中确定待检测文本库,并从所述待检测文本库中提取任意文本作为所述待检测文本。A to-be-detected text library is determined from the storage location, and any text is extracted from the to-be-detected text library as the to-be-detected text.
  17. 根据权利要求15所述的存储介质,其中,在所述根据所述多个待检测分词及所述多个目标分词生成待检测特征向量时,所述至少一个计算机可读指令被处理器执行以实现以下步骤:16. The storage medium of claim 15, wherein when the feature vector to be detected is generated according to the plurality of word segments to be detected and the plurality of target word segments, the at least one computer-readable instruction is executed by the processor to Implement the following steps:
    根据所述多个待检测分词与所述所有分词的映射关系生成所述待检测特征向量。The to-be-detected feature vector is generated according to the mapping relationship between the plurality of to-be-detected word segments and all of the word segments.
  18. 根据权利要求15所述的存储介质,其中,在所述根据所述待检测文本的语气与所述目标文本的语气确定所述待检测文本与所述目标文本的极性特征时,所述至少一个计算机可读指令被处理器执行以实现以下步骤:16. The storage medium according to claim 15, wherein, when determining the polarity characteristics of the text to be detected and the target text according to the tone of the text to be detected and the tone of the target text, the at least A computer readable instruction is executed by the processor to implement the following steps:
    检测所述待检测文本中是否包含预设词语,得到第一检测结果,并检测所述目标文本中是否包含所述预设词语,得到第二检测结果,所述预设词语用于指示否定语气;Detecting whether the text to be detected contains a preset word to obtain a first detection result, and detecting whether the target text contains the preset word to obtain a second detection result, where the preset word is used to indicate a negative tone ;
    根据所述第一检测结果确定所述待检测文本的第一语气,并根据所述第二检测结果确定所述目标文本的第二语气;Determine the first tone of the text to be detected according to the first detection result, and determine the second tone of the target text according to the second detection result;
    若所述第一语气与所述第二语气相同,将所述极性特征确定为第一数值;或者If the first tone is the same as the second tone, determining the polarity feature as a first value; or
    若所述第一语气与所述第二语气不同,将所述极性特征确定为第二数值。If the first tone is different from the second tone, the polarity feature is determined as a second value.
  19. 根据权利要求15所述的存储介质,其中,在所述将所述待检测文本转换为待检测语义向量时,所述至少一个计算机可读指令被处理器执行时以实现以下步骤:The storage medium of claim 15, wherein, when the text to be detected is converted into a semantic vector to be detected, the at least one computer-readable instruction is executed by the processor to achieve the following steps:
    将所述待检测文本转换为字向量序列;converting the text to be detected into a sequence of word vectors;
    利用正向长短期记忆网络对所述字向量序列进行特征抽取,得到第一特征向量;Use forward long short-term memory network to perform feature extraction on the word vector sequence to obtain a first feature vector;
    利用反向长短期记忆网络对所述字向量序列进行特征抽取,得到第二特征向量;Use the reverse long short-term memory network to perform feature extraction on the word vector sequence to obtain a second feature vector;
    拼接所述第一特征向量及所述第二特征向量,得到所述待检测语义向量。Splicing the first feature vector and the second feature vector to obtain the to-be-detected semantic vector.
  20. 根据权利要求15所述的存储介质,其中,在所述根据所述待检测语义向量及所述目标语义向量生成所述待检测文本与所述目标文本的语义特征时,所述至少一个计算机可读指令被处理器执行以实现以下步骤:The storage medium according to claim 15, wherein, when generating the semantic features of the text to be detected and the target text according to the semantic vector to be detected and the target semantic vector, the at least one computer can The read instruction is executed by the processor to implement the following steps:
    将所述待检测语义向量减去所述目标语义向量,得到差向量;Subtract the target semantic vector from the to-be-detected semantic vector to obtain a difference vector;
    拼接所述待检测语义向量、所述目标语义向量及所述差向量,得到拼接语义向量;Splicing the semantic vector to be detected, the target semantic vector and the difference vector to obtain a splicing semantic vector;
    利用预先构建好的多层隐层对所述拼接语义向量进行迭代映射,得到所述语义特征。The spliced semantic vector is iteratively mapped by using a pre-built multi-layer hidden layer to obtain the semantic feature.
PCT/CN2021/109391 2021-01-19 2021-07-29 Similar text determination method and related device WO2022156180A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110071000.0 2021-01-19
CN202110071000.0A CN112395886B (en) 2021-01-19 2021-01-19 Similar text determination method and related equipment

Publications (1)

Publication Number Publication Date
WO2022156180A1 true WO2022156180A1 (en) 2022-07-28

Family

ID=74625659

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/109391 WO2022156180A1 (en) 2021-01-19 2021-07-29 Similar text determination method and related device

Country Status (2)

Country Link
CN (1) CN112395886B (en)
WO (1) WO2022156180A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117195860A (en) * 2023-11-07 2023-12-08 品茗科技股份有限公司 Intelligent inspection method, system, electronic equipment and computer readable storage medium

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112395886B (en) * 2021-01-19 2021-04-13 深圳壹账通智能科技有限公司 Similar text determination method and related equipment
CN113239666B (en) * 2021-05-13 2023-09-29 深圳市智灵时代科技有限公司 Text similarity calculation method and system
CN113987115A (en) * 2021-09-26 2022-01-28 润联智慧科技(西安)有限公司 Text similarity calculation method, device, equipment and storage medium
CN116957368A (en) * 2022-03-31 2023-10-27 华为技术有限公司 Scoring method and related device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102880600A (en) * 2012-08-30 2013-01-16 北京航空航天大学 Word semantic tendency prediction method based on universal knowledge network
US20140249799A1 (en) * 2013-03-04 2014-09-04 Microsoft Corporation Relational similarity measurement
CN109145299A (en) * 2018-08-16 2019-01-04 北京金山安全软件有限公司 Text similarity determination method, device, equipment and storage medium
CN109635077A (en) * 2018-12-18 2019-04-16 武汉斗鱼网络科技有限公司 Calculation method, device, electronic equipment and the storage medium of text similarity
CN112395886A (en) * 2021-01-19 2021-02-23 深圳壹账通智能科技有限公司 Similar text determination method and related equipment

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070136839A1 (en) * 2003-10-14 2007-06-14 Ceres, Inc. Promoter, promoter control elements, and combinations, and uses thereof
CN108090047B (en) * 2018-01-10 2022-05-24 华南师范大学 Text similarity determination method and equipment
CN108052509B (en) * 2018-01-31 2019-06-28 北京神州泰岳软件股份有限公司 A kind of Text similarity computing method, apparatus and server
CN108595517B (en) * 2018-03-26 2021-03-09 南京邮电大学 Large-scale document similarity detection method
CN108874174B (en) * 2018-05-29 2020-04-24 腾讯科技(深圳)有限公司 Text error correction method and device and related equipment
CN110852056A (en) * 2018-07-25 2020-02-28 中兴通讯股份有限公司 Method, device and equipment for acquiring text similarity and readable storage medium
CN110781277A (en) * 2019-09-23 2020-02-11 厦门快商通科技股份有限公司 Text recognition model similarity training method, system, recognition method and terminal
CN111949766A (en) * 2020-08-20 2020-11-17 深圳市卡牛科技有限公司 Text similarity recognition method, system, equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102880600A (en) * 2012-08-30 2013-01-16 北京航空航天大学 Word semantic tendency prediction method based on universal knowledge network
US20140249799A1 (en) * 2013-03-04 2014-09-04 Microsoft Corporation Relational similarity measurement
CN109145299A (en) * 2018-08-16 2019-01-04 北京金山安全软件有限公司 Text similarity determination method, device, equipment and storage medium
CN109635077A (en) * 2018-12-18 2019-04-16 武汉斗鱼网络科技有限公司 Calculation method, device, electronic equipment and the storage medium of text similarity
CN112395886A (en) * 2021-01-19 2021-02-23 深圳壹账通智能科技有限公司 Similar text determination method and related equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117195860A (en) * 2023-11-07 2023-12-08 品茗科技股份有限公司 Intelligent inspection method, system, electronic equipment and computer readable storage medium
CN117195860B (en) * 2023-11-07 2024-03-26 品茗科技股份有限公司 Intelligent inspection method, system, electronic equipment and computer readable storage medium

Also Published As

Publication number Publication date
CN112395886B (en) 2021-04-13
CN112395886A (en) 2021-02-23

Similar Documents

Publication Publication Date Title
WO2022156180A1 (en) Similar text determination method and related device
WO2021114736A1 (en) Medical consultation assistance method and apparatus, electronic device, and medium
CN108292310B (en) Techniques for digital entity correlation
WO2022105122A1 (en) Answer generation method and apparatus based on artificial intelligence, and computer device and medium
WO2022105115A1 (en) Question and answer pair matching method and apparatus, electronic device and storage medium
US11901047B2 (en) Medical visual question answering
CN110134965B (en) Method, apparatus, device and computer readable storage medium for information processing
WO2022088671A1 (en) Automated question answering method and apparatus, device, and storage medium
WO2021120688A1 (en) Medical misdiagnosis detection method and apparatus, electronic device and storage medium
WO2021196825A1 (en) Abstract generation method and apparatus, and electronic device and medium
US11010566B2 (en) Inferring confidence and need for natural language processing of input data
WO2022160442A1 (en) Answer generation method and apparatus, electronic device, and readable storage medium
WO2023045184A1 (en) Text category recognition method and apparatus, computer device, and medium
WO2022073513A1 (en) Information input assistance method and apparatus, electronic device and storage medium
CN113268597B (en) Text classification method, device, equipment and storage medium
WO2024087297A1 (en) Text sentiment analysis method and apparatus, electronic device, and storage medium
CN113486680B (en) Text translation method, device, equipment and storage medium
CN113420545B (en) Abstract generation method, device, equipment and storage medium
CN113627186B (en) Entity relation detection method based on artificial intelligence and related equipment
TWI777319B (en) Method and device for determining stem cell density, computer device and storage medium
US20220027612A1 (en) Detecting and processing sections spanning processed document partitions
US11663215B2 (en) Selectively targeting content section for cognitive analytics and search
US11386056B2 (en) Duplicate multimedia entity identification and processing
US10971273B2 (en) Identification of co-located artifacts in cognitively analyzed corpora
US11055491B2 (en) Geographic location specific models for information extraction and knowledge discovery

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21920561

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 031123)

122 Ep: pct application non-entry in european phase

Ref document number: 21920561

Country of ref document: EP

Kind code of ref document: A1