CN118035765A - Text similarity matching method and device, storage medium and electronic equipment - Google Patents

Text similarity matching method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN118035765A
CN118035765A CN202410242368.2A CN202410242368A CN118035765A CN 118035765 A CN118035765 A CN 118035765A CN 202410242368 A CN202410242368 A CN 202410242368A CN 118035765 A CN118035765 A CN 118035765A
Authority
CN
China
Prior art keywords
word frequency
text
vector
word
vocabulary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410242368.2A
Other languages
Chinese (zh)
Inventor
张鹏飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Aijiwei Consulting Xiamen Co ltd
Original Assignee
Aijiwei Consulting Xiamen Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Aijiwei Consulting Xiamen Co ltd filed Critical Aijiwei Consulting Xiamen Co ltd
Priority to CN202410242368.2A priority Critical patent/CN118035765A/en
Publication of CN118035765A publication Critical patent/CN118035765A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a text similarity matching method, a device, a storage medium and electronic equipment, wherein the text similarity matching method comprises the steps of obtaining a first text and a second text to be matched; preprocessing the first text and the second text to form a first vocabulary, a second vocabulary and a word stock; vectorizing the first vocabulary and the second vocabulary through the vocabulary bank to generate a first word frequency vector and a second word frequency vector; performing vector difference processing and word frequency summation processing based on the first word frequency vector and the second word frequency vector to generate a difference vector and a target word frequency sum; and calculating the text similarity of the first text and the second text by using the difference vector and the target word frequency sum. The text similarity matching efficiency can be improved by the scheme.

Description

Text similarity matching method and device, storage medium and electronic equipment
Technical Field
The application relates to the technical field of data processing, in particular to a text similarity matching method and device, a storage medium and electronic equipment.
Background
With the rapid development of the internet of computers, text similarity matching is increasingly widely used in the field of computers. The method has the core aim of finding the similarity between two or more texts, thereby realizing the tasks of automatic text comparison, matching, abstract and the like. Today, with rapid development of the internet of computers, text similarity calculation technology plays an important role in many fields, such as machine translation, question-answering systems, web page search, and the like.
However, the current text similarity matching method takes a long time for similarity matching when aiming at article data of a long text type, so that the user experience is poor.
Disclosure of Invention
The embodiment of the application provides a text similarity matching method, a device, a storage medium and electronic equipment, which can improve the efficiency of text similarity matching.
In a first aspect, an embodiment of the present application provides a text similarity matching method, including:
Acquiring a first text and a second text to be matched;
Preprocessing the first text and the second text to form a first word list, a second word list and a word stock;
vectorizing the first vocabulary and the second vocabulary through the word stock to generate a first word frequency vector and a second word frequency vector;
Performing vector difference value processing and word frequency summation processing based on the first word frequency vector and the second word frequency vector to generate a difference value vector and a target word frequency sum;
and calculating the text similarity of the first text and the second text by using the difference vector and the target word frequency sum.
In the text similarity matching method provided by the embodiment of the present application, the performing vector difference processing and word frequency summation processing based on the first word frequency vector and the second word frequency vector to generate a difference vector and a target word frequency sum includes:
Obtaining a difference vector of the first word frequency vector and the second word frequency vector;
Respectively calculating word frequency sums of the first word frequency vector and the second word frequency vector to obtain a first word frequency sum and a second word frequency sum;
and comparing the first word frequency sum with the second word frequency sum to determine a target word frequency sum.
In the text similarity matching method provided by the embodiment of the application, the comparing the first word frequency sum with the second word frequency sum to determine the target word frequency sum includes:
Comparing the first word frequency sum with the second word frequency sum;
when the first word frequency sum is larger than the second word frequency sum, taking the first word frequency sum as a target word frequency sum;
And when the first word frequency sum is smaller than the second word frequency sum, taking the second word frequency sum as a target word frequency sum.
In the text similarity matching method provided by the embodiment of the present application, the vectorizing processing is performed on the first vocabulary and the second vocabulary by the word stock to generate a first word frequency vector and a second word frequency vector, which includes:
traversing the first vocabulary to determine the word frequency of each word in the first vocabulary in the word stock to form a first word frequency vector;
traversing the second vocabulary to determine the word frequency of each word in the second vocabulary in the word stock to form a second word frequency vector.
In the text similarity matching method provided by the embodiment of the application, the preprocessing is performed on the first text and the second text to form a first vocabulary, a second vocabulary and a word stock, and the method comprises the following steps:
Word segmentation is carried out on the first text and the second text respectively, so that a first word list and a second word list are formed;
Generating a word stock based on the first word list and the second word list.
In the text similarity matching method provided by the embodiment of the application, the generating a word stock based on the first word list and the second word list includes:
and merging and de-duplication processing is carried out on the first vocabulary and the second vocabulary, so as to generate a word stock.
In the text similarity matching method provided by the embodiment of the present application, the calculating the text similarity of the first text and the second text by using the difference vector and the target word frequency sum includes:
Calculating the quotient of the difference vector and the target word frequency sum;
And taking the quotient of the difference vector and the target word frequency sum as the text similarity of the first text and the second text.
In a second aspect, an embodiment of the present application provides a text similarity matching apparatus, including:
the text acquisition unit is used for acquiring a first text and a second text to be matched;
the text processing unit is used for preprocessing the first text and the second text to form a first word list, a second word list and a word stock;
the vocabulary processing unit is used for carrying out vectorization processing on the first vocabulary and the second vocabulary through the vocabulary library to generate a first word frequency vector and a second word frequency vector;
the vector processing unit is used for carrying out vector difference value processing and word frequency summation processing based on the first word frequency vector and the second word frequency vector, and generating a difference value vector and a target word frequency sum;
and the text matching unit is used for calculating the text similarity of the first text and the second text by using the difference vector and the target word frequency sum.
In a third aspect, the present application provides a storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the text similarity matching method of any one of the above.
In a fourth aspect, the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements any one of the above text similarity matching methods when executing the computer program.
In summary, the text similarity matching method provided by the embodiment of the application includes obtaining a first text and a second text to be matched; preprocessing the first text and the second text to form a first word list, a second word list and a word stock; vectorizing the first vocabulary and the second vocabulary through the word stock to generate a first word frequency vector and a second word frequency vector; performing vector difference value processing and word frequency summation processing based on the first word frequency vector and the second word frequency vector to generate a difference value vector and a target word frequency sum; and calculating the text similarity of the first text and the second text by using the difference vector and the target word frequency sum. According to the scheme, the first word list and the second word list can be subjected to vectorization processing, so that similarity matching of the first text and the second text can be rapidly calculated in parallel, and the efficiency of text similarity matching is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic structural diagram of a text similarity matching system according to an embodiment of the present application.
Fig. 2 is a flow chart of a text similarity matching method according to an embodiment of the present application.
Fig. 3 is a schematic structural diagram of a text similarity matching device according to an embodiment of the present application.
Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the element defined by the phrase "comprising one … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element, and furthermore, elements having the same name in different embodiments of the application may have the same meaning or may have different meanings, the particular meaning of which is to be determined by its interpretation in this particular embodiment or by further combining the context of this particular embodiment.
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
In the following description, suffixes such as "module", "part" or "unit" for representing elements are used only for facilitating the description of the present application, and have no specific meaning per se. Thus, "module," "component," or "unit" may be used in combination.
In the description of the present application, it should be noted that the positional or positional relationship indicated by the terms such as "upper", "lower", "left", "right", "inner", "outer", etc. are based on the positional or positional relationship shown in the drawings, are merely for convenience of describing the present application and simplifying the description, and do not indicate or imply that the apparatus or element in question must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present application. Furthermore, the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
The existing text similarity matching method is long in similarity matching time when aiming at article data of long text types, and poor user experience is caused.
Based on the above, the embodiment of the application provides a text similarity matching method, a device, a storage medium and an electronic device, and in particular, the text similarity matching device can be integrated in the electronic device, and the electronic device can be a server or a terminal and other devices; the terminal can comprise a mobile phone, a wearable intelligent device, a tablet computer, a notebook computer, a personal computer (Personal Computer, PC) and the like; the server can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and can also be a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms and the like.
Referring to fig. 1, fig. 1 is a schematic structural diagram of a text similarity matching system according to an embodiment of the present application. The system may include at least one electronic device 1000, at least one server or personal computer 2000. The electronic device 1000 held by the user may be connected to different servers or personal computers through a network. The electronic device 1000 may be an electronic device having computing hardware capable of supporting and executing software products corresponding to multimedia. In addition, the electronic device 1000 may also have one or more multi-touch sensitive screens for sensing and obtaining input from a user through touch or slide operations performed at multiple points of the one or more touch sensitive display screens. In addition, the electronic device 1000 may be connected to a server or a personal computer 2000 through a network. The network may be a wireless network or a wired network, such as a Wireless Local Area Network (WLAN), a Local Area Network (LAN), a cellular network, a 2G network, a 3G network, a 4G network, a 5G network, etc. In addition, the different electronic devices 1000 may be connected to other embedded platforms or to a server, a personal computer, or the like using their own bluetooth network or hotspot network.
The electronic equipment comprises a touch display screen and a processor, wherein the touch display screen is used for presenting a graphical user interface and receiving an operation instruction generated by a user acting on the graphical user interface. When a user operates the graphical user interface through the touch display screen, the graphical user interface can control local content of the electronic equipment by responding to a received operation instruction, and can also control content of a server side by responding to the received operation instruction. For example, the user-generated operational instructions acting on the graphical user interface include instructions for processing the initial audio signal, and the processor is configured to launch a corresponding application upon receiving the user-provided instructions. Further, the processor is configured to render and draw a graphical user interface associated with the application on the touch-sensitive display screen. A touch display screen is a multi-touch-sensitive screen capable of sensing touch or slide operations performed simultaneously by a plurality of points on the screen. The user performs touch operation on the graphical user interface by using a finger, and when the graphical user interface detects the touch operation, the graphical user interface controls the graphical user interface of the application to display the corresponding operation.
The technical schemes shown in the application will be respectively described in detail through specific examples. The following description of the embodiments is not intended to limit the priority of the embodiments.
Referring to fig. 2, fig. 2 is a flow chart of a text similarity matching method according to an embodiment of the application. The specific flow of the text similarity matching method can be as follows:
101. and acquiring a first text and a second text to be matched.
Wherein the first text and the second text may be any form of information carrier, such as an article, sentence, paragraph, etc. In addition, the first text and the second text may be in any language, such as chinese, english, french, and the like.
In some embodiments, after the first text and the second text are acquired, factor elimination processing may be performed on the first text and the second text, respectively. For example, irrelevant factors such as irrelevant characters, stop words, punctuation marks and the like in the first text and the second text can be removed respectively, and the text can be converted into a uniform format and code. Therefore, the influence of irrelevant factors on the similarity matching is eliminated, and the accuracy of the similarity matching is improved.
102. And preprocessing the first text and the second text to form a first vocabulary, a second vocabulary and a word stock.
Specifically, the first text and the second text may be first segmented to form a first vocabulary and a second vocabulary; a lexicon is then generated based on the first vocabulary and the second vocabulary.
In some embodiments, the lexicon may be generated by combining and deduplicating the first vocabulary and the second vocabulary.
For example, the first text a is: helping clients to realize service growth and digital transformation; the second text B is: helping clients to realize business transformation. After word segmentation is carried out on the first text A, a first word list A T is obtained: help, customer, implementation, business, growth, and, digitization, transformation. After the second text B is segmented, a second vocabulary B T is obtained: help, customer, realization, business, transformation. At this time, the first vocabulary a T and the second vocabulary B T are combined and de-duplicated to obtain a lexicon C G: help, customer, implementation, business, growth, and, digitization, transformation.
103. And vectorizing the first vocabulary and the second vocabulary through the lexicon to generate a first word frequency vector and a second word frequency vector.
In some embodiments, the first vocabulary may be traversed to determine a word frequency at which each word in the first vocabulary appears in the lexicon, forming a first word frequency vector; traversing the second vocabulary to determine the word frequency of each word in the second vocabulary in the word stock to form a second word frequency vector.
Specifically, the first vocabulary a T and the second vocabulary B T may be traversed respectively, so as to count the number of occurrences of the words in the first vocabulary a T and the second vocabulary B T in the thesaurus C G, and mark the positions of the word in the thesaurus response, and the non-occurrence marks are 0, so as to form a first word frequency vector C A and a second word frequency vector C B.
For example, words in the word stock C G may be initialized to a vector of 0, i.e., words such as help, customer, implementation, business, growth, and digitizing, transformation, etc. may be initialized to a vector of 0 to obtain the word stock vector [0,0,0,0,0,0,0,0]. Then, traversing the first vocabulary A T, counting the word frequency of each word in the first vocabulary A T in the word stock C G, and replacing the word frequency of each word with the corresponding position in the word stock vector [0,0,0,0,0,0,0,0] to obtain a first word frequency vector C A: [1,1,1,1,1,1,1,1]. Then traversing the second vocabulary B T, counting the word frequency of each word in the second vocabulary B T in the word stock C G, and replacing the word frequency of each word with the corresponding position in the word stock vector [0,0,0,0,0,0,0,0] to obtain a second word frequency vector C B: [1,1,1,1,0,0,0,1].
104. And carrying out vector difference value processing and word frequency summation processing based on the first word frequency vector and the second word frequency vector, and generating a difference value vector and a target word frequency sum.
Specifically, a difference vector between the first word frequency vector and the second word frequency vector can be obtained; respectively calculating word frequency sums of the first word frequency vector and the second word frequency vector to obtain a first word frequency sum and a second word frequency sum; and comparing the first word frequency sum with the second word frequency sum to determine a target word frequency sum.
In some embodiments, the first word frequency sum may be compared to the second word frequency sum to determine a size of the first word frequency sum and the second word frequency sum. When the first word frequency sum is larger than the second word frequency sum, taking the first word frequency sum as a target word frequency sum; and when the first word frequency sum is smaller than the second word frequency sum, taking the second word frequency sum as a target word frequency sum.
The word frequency sum is calculated by adding the word frequencies of the words in the corresponding word frequency vectors. For example, the first word frequency vector C A: the first word frequency sum of [1,1,1,1,1,1,1,1] is 1+1+1+1+1+1+1=8. Second word frequency vector C B: the second word frequency sum of [1,1,1,1,0,0,0,1] is 1+1+1+1+0+0+0+1=5. At this time, the first word frequency sum is greater than the second word frequency sum, and then the target word frequency sum C max is the first word frequency sum 8.
It will be appreciated that when the first word frequency sum is equal to the second word frequency sum, the first word frequency sum or the second word frequency sum may be directly taken as the target word frequency sum.
In some embodiments, the obtaining the difference vector between the first word frequency vector and the second word frequency vector may specifically be obtaining a difference between the first word frequency vector and the second word frequency vector, and performing absolute value processing, so as to obtain the difference vector.
For example, the first word frequency vector CA is taken: [1,1,1,1,1,1,1,1] and a second word frequency vector CB:
The difference value of [1,1,1,1,0,0,0,1] is subjected to absolute value processing, so that a difference value vector C A-B can be obtained:
[0,0,0,0,1,1,1,0]。
105. and calculating the text similarity of the first text and the second text by using the difference vector and the target word frequency sum.
Specifically, a quotient of the difference vector and the target word frequency sum can be calculated; and taking the quotient of the difference vector and the target word frequency sum as the text similarity of the first text and the second text.
The specific process of calculating the quotient of the difference vector and the target word frequency sum is to sum the elements in the difference vector to obtain the element sum of the difference vector, and then divide the element sum and the target word frequency sum to obtain the text similarity of the first text and the second text.
For example, the target word frequency sum C max is 8, the difference vector C A-B: [0,0,0,0,1,1,1,0]. At this time, the elements in the difference vector C A-B are summed, i.e., 0+0+0+0+1+1+1+0=3. Then, the similarity between the first text and the second text is 0.375, which is obtained by dividing 3 by 8.
In summary, the text similarity matching method provided by the embodiment of the application includes obtaining a first text and a second text to be matched; preprocessing the first text and the second text to form a first vocabulary, a second vocabulary and a word stock; vectorizing the first vocabulary and the second vocabulary through the vocabulary bank to generate a first word frequency vector and a second word frequency vector; performing vector difference processing and word frequency summation processing based on the first word frequency vector and the second word frequency vector to generate a difference vector and a target word frequency sum; and calculating the text similarity of the first text and the second text by using the difference vector and the target word frequency sum. According to the method and the device, the text is segmented during calculation, so that word level comparison is achieved, and compared with the existing word level comparison, the calculation efficiency is greatly improved. In addition, the scheme can carry out vectorization processing on the first word list and the second word list, so that similarity matching of the first text and the second text can be rapidly and parallelly calculated. That is, the scheme shortens the comparison space through the word segmentation technology, avoids the comparison between the words, and optimizes the comparison between the words. In addition, the calculation efficiency is improved through calculation among vectors. That is, the text similarity matching efficiency can be greatly improved by the scheme.
In order to facilitate better implementation of the text similarity matching method provided by the embodiment of the application, the embodiment of the application also provides a text similarity matching device. The meaning of the nouns is the same as that in the text similarity matching method, and specific implementation details can be referred to the description in the embodiment of the method.
Referring to fig. 3, fig. 3 is a schematic structural diagram of a text similarity matching device according to an embodiment of the present application. The text similarity matching apparatus may include a text acquisition unit 201, a text processing unit 202, a vocabulary processing unit 203, a vector processing unit 204, and a text matching unit 205. Wherein,
A text obtaining unit 201, configured to obtain a first text and a second text to be matched;
A text processing unit 202, configured to pre-process the first text and the second text to form a first vocabulary, a second vocabulary and a word stock;
A vocabulary processing unit 203, configured to perform vectorization processing on the first vocabulary and the second vocabulary through a vocabulary base, and generate a first word frequency vector and a second word frequency vector;
A vector processing unit 204, configured to perform vector difference processing and word frequency summation processing based on the first word frequency vector and the second word frequency vector, and generate a difference vector and a target word frequency sum;
the text matching unit 205 is configured to calculate a text similarity between the first text and the second text using the difference vector and the target word frequency sum.
The specific embodiments of the above units may be referred to the above embodiments of the text similarity matching method, and will not be described herein in detail.
In summary, the text similarity matching device provided by the embodiment of the present application may obtain, through the text obtaining unit 201, a first text and a second text to be matched; preprocessing the first text and the second text by the text processing unit 202 to form a first vocabulary, a second vocabulary and a word stock; the vocabulary processing unit 203 performs vectorization processing on the first vocabulary and the second vocabulary through a vocabulary bank to generate a first word frequency vector and a second word frequency vector; vector processing unit 204 performs vector difference processing and word frequency summation processing based on the first word frequency vector and the second word frequency vector to generate a difference vector and a target word frequency sum; the text similarity of the first text and the second text is calculated by the text matching unit 205 using the difference vector and the target word frequency sum. According to the method and the device, the text is segmented during calculation, so that word level comparison is achieved, and compared with the existing word level comparison, the calculation efficiency is greatly improved. In addition, the scheme can carry out vectorization processing on the first word list and the second word list, so that similarity matching of the first text and the second text can be rapidly and parallelly calculated. That is, the scheme shortens the comparison space through the word segmentation technology, avoids the comparison between the words, and optimizes the comparison between the words. In addition, the calculation efficiency is improved through calculation among vectors. That is, the text similarity matching efficiency can be greatly improved by the scheme.
The embodiment of the application also provides an electronic device, in which the text similarity matching device of the embodiment of the application can be integrated, as shown in fig. 4, which shows a schematic structural diagram of the electronic device according to the embodiment of the application, specifically:
The electronic device may include Radio Frequency (RF) circuitry 601, memory 602 including one or more computer readable storage media, input unit 603, display unit 604, sensor 605, audio circuit 606, wireless fidelity (WIRELESS FIDELITY, WIFI) module 607, processor 608 including one or more processing cores, and power supply 609. Those skilled in the art will appreciate that the electronic device structure shown in fig. 4 is not limiting of the electronic device and may include more or fewer components than shown, or may combine certain components, or may be arranged in different components.
Wherein:
The RF circuit 601 may be used for receiving and transmitting signals during a message or a call, and in particular, after receiving downlink information of a base station, the downlink information is processed by one or more processors 608; in addition, data relating to uplink is transmitted to the base station. Typically, RF circuitry 601 includes, but is not limited to, an antenna, at least one amplifier, a tuner, one or more oscillators, a subscriber identity module (Subscriber Identity Module, SIM) card, a transceiver, a coupler, a low noise amplifier (Low Noise Amplifier, LNA), a duplexer, and the like. In addition, the RF circuitry 601 may also communicate with networks and other devices through wireless communications. The wireless communication may use any communication standard or protocol including, but not limited to, global system for mobile communications (Global System of Mobile communication, GSM), general Packet Radio Service (GPRS), code division multiple access (Code Division Multiple Access, CDMA), wideband code division multiple access (Wideband Code Division Multiple Access, WCDMA), long term evolution (Long Term Evolution, LTE), email, short message Service (Short MESSAGING SERVICE, SMS), and the like.
The memory 602 may be used to store software programs and modules, and the processor 608 may execute various functional applications and information processing by executing the software programs and modules stored in the memory 602. The memory 602 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data created according to the use of the electronic device (such as audio data, phonebooks, etc.), and the like. In addition, the memory 602 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 602 may also include a memory controller to provide access to the memory 602 by the processor 608 and the input unit 603.
The input unit 603 may be used to receive input numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control. In particular, in one particular embodiment, the input unit 603 may include a touch-sensitive surface, as well as other input devices. The touch-sensitive surface, also referred to as a touch display screen or a touch pad, may collect touch operations thereon or thereabout by a user (e.g., operations thereon or thereabout by a user using any suitable object or accessory such as a finger, stylus, etc.), and actuate the corresponding connection means according to a predetermined program. Alternatively, the touch-sensitive surface may comprise two parts, a touch detection device and a touch controller. The touch detection device detects the touch azimuth of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch detection device and converts it into touch point coordinates, which are then sent to the processor 608, and can receive commands from the processor 608 and execute them. In addition, touch sensitive surfaces may be implemented in a variety of types, such as resistive, capacitive, infrared, and surface acoustic waves. The input unit 603 may comprise other input devices in addition to a touch sensitive surface. In particular, other input devices may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, mouse, joystick, etc.
The display unit 604 may be used to display information entered by a user or provided to a user as well as various graphical user interfaces of the electronic device, which may be composed of graphics, text, icons, video, and any combination thereof. The display unit 604 may include a display panel, which may optionally be configured in the form of a Liquid crystal display (Liquid CRYSTAL DISPLAY, LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch-sensitive surface may overlay a display panel, and upon detection of a touch operation thereon or thereabout, the touch-sensitive surface is passed to the processor 608 to determine the type of touch event, and the processor 608 then provides a corresponding visual output on the display panel based on the type of touch event. Although in fig. 4 the touch sensitive surface and the display panel are implemented as two separate components for input and output functions, in some embodiments the touch sensitive surface may be integrated with the display panel to implement the input and output functions.
The electronic device may also include at least one sensor 605, such as a light sensor, a motion sensor, and other sensors. In particular, the light sensor may include an ambient light sensor that may adjust the brightness of the display panel according to the brightness of ambient light, and a proximity sensor that may turn off the display panel and/or backlight when the electronic device is moved to the ear. As one of the motion sensors, the gravity acceleration sensor can detect the acceleration in all directions (generally three axes), and can detect the gravity and the direction when the mobile phone is stationary, and can be used for applications of recognizing the gesture of the mobile phone (such as horizontal and vertical screen switching, related games, magnetometer gesture calibration), vibration recognition related functions (such as pedometer and knocking), and the like; other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc. that may also be configured with the electronic device are not described in detail herein.
Audio circuitry 606, speakers, and a microphone may provide an audio interface between the user and the electronic device. The audio circuit 606 may transmit the received electrical signal after audio data conversion to a speaker, where the electrical signal is converted to a sound signal for output; on the other hand, the microphone converts the collected sound signals into electrical signals, which are received by the audio circuit 606 and converted into audio data, which are processed by the audio data output processor 608 for transmission via the RF circuit 601 to, for example, another electronic device, or which are output to the memory 602 for further processing. The audio circuit 606 may also include an ear bud jack to provide communication of the peripheral ear bud with the electronic device.
WiFi belongs to a short-distance wireless transmission technology, and the electronic equipment can help a user to send and receive emails, browse webpages, access streaming media and the like through the WiFi module 607, so that wireless broadband Internet access is provided for the user. Although fig. 4 shows a WiFi module 607, it is understood that it does not belong to the necessary constitution of the electronic device, and can be omitted entirely as needed within the scope of not changing the essence of the invention.
The processor 608 is a control center of the electronic device that uses various interfaces and lines to connect the various parts of the overall handset, performing various functions of the electronic device and processing the data by running or executing software programs and/or modules stored in the memory 602, and invoking data stored in the memory 602, thereby performing overall monitoring of the handset. Optionally, the processor 608 may include one or more processing cores; preferably, the processor 608 may integrate an application processor that primarily handles operating systems, user interfaces, applications, etc., with a modem processor that primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 608.
The electronic device also includes a power supply 609 (e.g., a battery) for powering the various components, which may be logically connected to the processor 608 via a power management system so as to perform functions such as managing charge, discharge, and power consumption via the power management system. The power supply 609 may also include one or more of any components, such as a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.
Although not shown, the electronic device may further include a camera, a bluetooth module, etc., which will not be described herein. In particular, in this embodiment, the processor 608 in the electronic device loads executable files corresponding to the processes of one or more application programs into the memory 602 according to the following instructions, and the processor 608 executes the application programs stored in the memory 602, so as to implement various functions, for example:
Acquiring a first text and a second text to be matched;
Preprocessing the first text and the second text to form a first vocabulary, a second vocabulary and a word stock;
vectorizing the first vocabulary and the second vocabulary through the vocabulary bank to generate a first word frequency vector and a second word frequency vector;
performing vector difference processing and word frequency summation processing based on the first word frequency vector and the second word frequency vector to generate a difference vector and a target word frequency sum;
And calculating the text similarity of the first text and the second text by using the difference vector and the target word frequency sum.
In summary, the electronic device provided by the embodiment of the application can obtain the first text and the second text to be matched; preprocessing the first text and the second text to form a first vocabulary, a second vocabulary and a word stock; vectorizing the first vocabulary and the second vocabulary through the vocabulary bank to generate a first word frequency vector and a second word frequency vector; performing vector difference processing and word frequency summation processing based on the first word frequency vector and the second word frequency vector to generate a difference vector and a target word frequency sum; and calculating the text similarity of the first text and the second text by using the difference vector and the target word frequency sum. According to the method and the device, the text is segmented during calculation, so that word level comparison is achieved, and compared with the existing word level comparison, the calculation efficiency is greatly improved. In addition, the scheme can carry out vectorization processing on the first word list and the second word list, so that similarity matching of the first text and the second text can be rapidly and parallelly calculated. That is, the scheme shortens the comparison space through the word segmentation technology, avoids the comparison between the words, and optimizes the comparison between the words. In addition, the calculation efficiency is improved through calculation among vectors. That is, the text similarity matching efficiency can be greatly improved by the scheme.
In the foregoing embodiments, the descriptions of the embodiments are focused on, and the portions of an embodiment that are not described in detail may be referred to the above detailed description of the text similarity matching method, which is not repeated herein.
It should be noted that, for the text similarity matching method in the embodiment of the present application, it will be understood by those skilled in the art that all or part of the process of implementing the text similarity matching method in the embodiment of the present application may be implemented by controlling related hardware by a computer program, where the computer program may be stored in a computer readable storage medium, such as a memory of a terminal, and executed by at least one processor in the terminal, and the execution may include the process of implementing the embodiment of the text similarity matching method.
For the text similarity matching device of the embodiment of the application, each functional module can be integrated in one processing chip, each module can exist alone physically, and two or more modules can be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented as software functional modules and sold or used as a stand-alone product.
To this end, an embodiment of the present application provides a storage medium having stored therein a plurality of instructions capable of being loaded by a processor to perform the steps of any of the text similarity matching methods provided by the embodiments of the present application. The storage medium may be a magnetic disk, an optical disk, a Read Only MeMory (ROM), a random access MeMory (Random Access Memory, RAM), or the like.
The text similarity matching method, the device, the storage medium and the electronic equipment provided by the application are respectively described in detail, and specific examples are applied to the explanation of the principle and the implementation mode of the application, and the explanation of the above examples is only used for helping to understand the core idea of the application; meanwhile, as those skilled in the art will vary in the specific embodiments and application scope according to the ideas of the present application, the present description should not be construed as limiting the present application in summary.

Claims (10)

1. A text similarity matching method, comprising:
Acquiring a first text and a second text to be matched;
Preprocessing the first text and the second text to form a first word list, a second word list and a word stock;
vectorizing the first vocabulary and the second vocabulary through the word stock to generate a first word frequency vector and a second word frequency vector;
Performing vector difference value processing and word frequency summation processing based on the first word frequency vector and the second word frequency vector to generate a difference value vector and a target word frequency sum;
and calculating the text similarity of the first text and the second text by using the difference vector and the target word frequency sum.
2. The text similarity matching method of claim 1, wherein said performing vector difference processing and word frequency summation processing based on said first word frequency vector and said second word frequency vector to generate a difference vector and a target word frequency sum comprises:
Obtaining a difference vector of the first word frequency vector and the second word frequency vector;
Respectively calculating word frequency sums of the first word frequency vector and the second word frequency vector to obtain a first word frequency sum and a second word frequency sum;
and comparing the first word frequency sum with the second word frequency sum to determine a target word frequency sum.
3. The text similarity matching method of claim 2, wherein said comparing said first word frequency sum with said second word frequency sum to determine a target word frequency sum comprises:
Comparing the first word frequency sum with the second word frequency sum;
when the first word frequency sum is larger than the second word frequency sum, taking the first word frequency sum as a target word frequency sum;
And when the first word frequency sum is smaller than the second word frequency sum, taking the second word frequency sum as a target word frequency sum.
4. The text similarity matching method of claim 1, wherein said vectorizing said first vocabulary and said second vocabulary by said lexicon to generate a first word frequency vector and a second word frequency vector comprises:
traversing the first vocabulary to determine the word frequency of each word in the first vocabulary in the word stock to form a first word frequency vector;
traversing the second vocabulary to determine the word frequency of each word in the second vocabulary in the word stock to form a second word frequency vector.
5. The text similarity matching method of claim 1, wherein the preprocessing the first text and the second text to form a first vocabulary, a second vocabulary, and a lexicon comprises:
Word segmentation is carried out on the first text and the second text respectively, so that a first word list and a second word list are formed;
Generating a word stock based on the first word list and the second word list.
6. The text similarity matching method of claim 5, wherein said generating a lexicon based on said first vocabulary and said second vocabulary comprises:
and merging and de-duplication processing is carried out on the first vocabulary and the second vocabulary, so as to generate a word stock.
7. The text similarity matching method of claim 1, wherein said calculating the text similarity of the first text and the second text using the difference vector and the target word frequency sum comprises:
Calculating the quotient of the difference vector and the target word frequency sum;
And taking the quotient of the difference vector and the target word frequency sum as the text similarity of the first text and the second text.
8. A text similarity matching apparatus, comprising:
the text acquisition unit is used for acquiring a first text and a second text to be matched;
the text processing unit is used for preprocessing the first text and the second text to form a first word list, a second word list and a word stock;
the vocabulary processing unit is used for carrying out vectorization processing on the first vocabulary and the second vocabulary through the vocabulary library to generate a first word frequency vector and a second word frequency vector;
the vector processing unit is used for carrying out vector difference value processing and word frequency summation processing based on the first word frequency vector and the second word frequency vector, and generating a difference value vector and a target word frequency sum;
and the text matching unit is used for calculating the text similarity of the first text and the second text by using the difference vector and the target word frequency sum.
9. A storage medium having stored thereon a plurality of instructions adapted to be loaded by a processor to perform the text similarity matching method of any of claims 1-7.
10. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the text similarity matching method of any of claims 1-7 when the computer program is executed by the processor.
CN202410242368.2A 2024-03-04 2024-03-04 Text similarity matching method and device, storage medium and electronic equipment Pending CN118035765A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410242368.2A CN118035765A (en) 2024-03-04 2024-03-04 Text similarity matching method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410242368.2A CN118035765A (en) 2024-03-04 2024-03-04 Text similarity matching method and device, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN118035765A true CN118035765A (en) 2024-05-14

Family

ID=90994793

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410242368.2A Pending CN118035765A (en) 2024-03-04 2024-03-04 Text similarity matching method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN118035765A (en)

Similar Documents

Publication Publication Date Title
CN106775637B (en) Page display method and device for application program
CN108334539B (en) Object recommendation method, mobile terminal and computer-readable storage medium
RU2612598C2 (en) Method, equipment and terminal symbol selection device
CN114973351B (en) Face recognition method, device, equipment and storage medium
CN107885718B (en) Semantic determination method and device
CN104281610B (en) The method and apparatus for filtering microblogging
CN111090877B (en) Data generation and acquisition methods, corresponding devices and storage medium
CN116795780A (en) Document format conversion method and device, storage medium and electronic equipment
CN116758362A (en) Image processing method, device, computer equipment and storage medium
CN116702698A (en) Word document formatting method, device, equipment and storage medium
CN116596202A (en) Work order processing method, related device and storage medium
CN115187999A (en) Text recognition method and device, electronic equipment and computer readable storage medium
CN118035765A (en) Text similarity matching method and device, storage medium and electronic equipment
CN111027406B (en) Picture identification method and device, storage medium and electronic equipment
CN109240986B (en) Log processing method and device and storage medium
CN115412726B (en) Video authenticity detection method, device and storage medium
CN113806533B (en) Metaphor sentence type characteristic word extraction method, metaphor sentence type characteristic word extraction device, metaphor sentence type characteristic word extraction medium and metaphor sentence type characteristic word extraction equipment
CN111221782B (en) File searching method and device, storage medium and mobile terminal
CN112733573B (en) Form detection method and device, mobile terminal and storage medium
CN110909190B (en) Data searching method and device, electronic equipment and storage medium
CN109614483B (en) Information classification method and terminal equipment
CN117725233A (en) Information searching method, device, medium and equipment for multimedia platform
CN117172219A (en) Automatic document generation method and device, storage medium and electronic equipment
CN115831120A (en) Corpus data acquisition method and device, electronic equipment and readable storage medium
CN116244071A (en) Resource adjustment method, related equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination