CN113515954A - Word string relevance calculation method and system and electronic equipment - Google Patents

Word string relevance calculation method and system and electronic equipment Download PDF

Info

Publication number
CN113515954A
CN113515954A CN202110917193.7A CN202110917193A CN113515954A CN 113515954 A CN113515954 A CN 113515954A CN 202110917193 A CN202110917193 A CN 202110917193A CN 113515954 A CN113515954 A CN 113515954A
Authority
CN
China
Prior art keywords
vocabulary
processed
string
word
target text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110917193.7A
Other languages
Chinese (zh)
Inventor
陈海峰
李强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhongaotao Data Technology Co ltd
Original Assignee
Beijing Zhongaotao Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongaotao Data Technology Co ltd filed Critical Beijing Zhongaotao Data Technology Co ltd
Priority to CN202110917193.7A priority Critical patent/CN113515954A/en
Publication of CN113515954A publication Critical patent/CN113515954A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/44Statistical methods, e.g. probability models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/14Fourier, Walsh or analogous domain transformations, e.g. Laplace, Hilbert, Karhunen-Loeve, transforms
    • G06F17/141Discrete Fourier transforms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

The invention provides a method, a system and electronic equipment for calculating word string relevance, and relates to the field of word string relevance calculation, wherein the method comprises the steps of firstly, acquiring a word string to be processed from a target text; then, counting the types and the number of vocabularies in the character strings to be processed within a preset time period; determining a vocabulary statistical curve under the category of the vocabulary according to the category and the number of the vocabulary in the character string to be processed; and finally, acquiring the characteristic vector of the vocabulary in the vocabulary statistical curve, and determining the correlation calculation result of the word string to be processed according to the characteristic vector of the vocabulary. According to the method, the target words in the target text can be screened, the occurrence frequency of the target words at the same time point is counted, the feature vectors at the same time point and the same dimensionality are obtained, the relevance calculation result of the word string to be processed is determined according to the feature vectors of the words, the relevance of other words is improved, and the calculation precision of the relevance result of related words across time and space is improved.

Description

Word string relevance calculation method and system and electronic equipment
Technical Field
The present invention relates to the field of word string relevancy calculation technologies, and in particular, to a method, a system, and an electronic device for calculating word string relevancy.
Background
The vocabulary relevance calculation in the text file is widely applied to the fields of text clustering, information retrieval and the like, but in the prior art, the calculation process of the vocabulary relevance only focuses on the relevance between a pair of vocabularies, and is lack of relevance with other vocabularies; resulting in large errors in the relevance computation results for the cross-time, cross-space related vocabulary.
Disclosure of Invention
In view of this, the present invention provides a method, a system, and an electronic device for calculating word string relevancy, in which the method obtains feature vectors of the same time point and the same dimension by screening target words in a target text and counting the occurrence frequency of the target words at the same time point, and determines a relevancy calculation result of a word string to be processed according to the feature vectors of the words, thereby improving relevancy of other words and improving calculation accuracy of relevancy results of related words across time and space.
In a first aspect, an embodiment of the present invention provides a method for calculating a string relevance, where the method includes:
acquiring a word string to be processed from a target text;
counting the types and the number of vocabularies in the character strings to be processed within a preset time period;
determining a vocabulary statistical curve under the category of the vocabulary according to the category and the number of the vocabulary in the character string to be processed; wherein, the abscissa of the vocabulary statistical curve is time; the vertical coordinate of the vocabulary statistical curve is the number of vocabularies in the character string to be processed;
and acquiring the characteristic vector of the vocabulary in the vocabulary statistical curve, and determining the correlation calculation result of the word string to be processed according to the characteristic vector of the vocabulary.
In some embodiments, the step of obtaining the word string to be processed from the target text includes:
acquiring a target text; the target text is obtained from a text database, a web crawler tool result and a log file;
reading a target text and traversing all word strings in the target text to obtain word strings corresponding to all words in the target text;
and determining the word strings to be processed according to the word strings corresponding to all the words in the target text.
In some embodiments, the step of reading the target text and traversing all word strings in the target text to obtain word strings corresponding to all vocabularies in the target text includes:
acquiring a shielding vocabulary list; the shielding vocabulary list is used for shielding vocabularies of the target text;
traversing all word strings in the target text, shielding the word strings belonging to the shielding vocabulary list, and determining the word strings corresponding to all the vocabulary in the target text.
In some embodiments, the step of determining a vocabulary statistical curve under the category of the vocabulary according to the category and the number of the vocabulary in the word string to be processed includes:
acquiring the type and the number of vocabularies in the character string to be processed within a preset time period;
and counting a vocabulary statistical curve in a preset time period according to the types of the vocabularies.
In some embodiments, obtaining feature vectors for words in the lexical statistics curve includes:
smoothing the vocabulary statistical curve to obtain a continuous smooth curve of the vocabulary;
performing Fourier transformation on the continuous smooth curve of the vocabulary to obtain a transformation result of the continuous smooth curve; wherein, the transformation result is in a complex form;
and calculating the amplitude of the transformation result, and determining the feature vector of the vocabulary according to the amplitude of the transformation result.
In some embodiments, determining the relevancy calculation result of the word string to be processed according to the feature vector of the vocabulary includes:
acquiring a feature vector of a vocabulary;
clustering the characteristic vectors of the vocabularies to determine clustering results of the vocabularies;
and determining the correlation calculation result of the word string to be processed according to the clustering result of the words.
In some embodiments, the step of counting the category and number of words in the word string to be processed within a preset time period includes:
acquiring character strings to be processed in a preset time period;
inputting the word string to be processed into a preset word segmentation model to obtain all words contained in the word string to be processed;
and counting all the words in the word strings to be processed, and determining the types and the number of the words in the word strings to be processed.
In a second aspect, an embodiment of the present invention provides a system for calculating string relevancy, where the system includes:
the word string to be processed acquiring module is used for acquiring a word string to be processed from the target text;
the word string counting module is used for counting the types and the number of words in the word string to be processed in a preset time period;
the vocabulary counting module is used for determining a vocabulary counting curve under the category of the vocabulary according to the category and the number of the vocabulary in the character string to be processed; wherein, the abscissa of the vocabulary statistical curve is time; the vertical coordinate of the vocabulary statistical curve is the number of vocabularies in the character string to be processed;
and the relevance calculating module is used for acquiring the characteristic vector of the vocabulary in the vocabulary statistical curve and determining the relevance calculating result of the word string to be processed according to the characteristic vector of the vocabulary.
In a third aspect, an embodiment of the present invention further provides an electronic device, including: a processor and a memory; the memory has stored thereon a computer program which, when being executed by the processor, implements the steps of the method for calculating string associations as mentioned in any of the possible embodiments of the first aspect.
In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, where the computer program, when executed by a processor, implements the steps of the method for calculating string associations, mentioned in any possible implementation manner of the first aspect.
The embodiment of the invention has the following beneficial effects:
the invention provides a method, a system and electronic equipment for calculating word string relevance, wherein the method comprises the steps of firstly, acquiring word strings to be processed from a target text; then, counting the types and the number of vocabularies in the character strings to be processed within a preset time period; determining a vocabulary statistical curve under the category of the vocabulary according to the category and the number of the vocabulary in the character string to be processed; wherein, the abscissa of the vocabulary statistical curve is time; the vertical coordinate of the vocabulary statistical curve is the number of vocabularies in the character string to be processed; and finally, acquiring the characteristic vector of the vocabulary in the vocabulary statistical curve, and determining the correlation calculation result of the word string to be processed according to the characteristic vector of the vocabulary. According to the method, the target words in the target text can be screened, the occurrence frequency of the target words at the same time point is counted, the feature vectors at the same time point and the same dimensionality are obtained, the relevance calculation result of the word string to be processed is determined according to the feature vectors of the words, the relevance of other words is improved, and the calculation precision of the relevance result of related words across time and space is improved.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the invention as set forth above.
In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flowchart illustrating a method for calculating string relevancy according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a step S101 of a method for calculating string relevancy according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating step S202 of a method for calculating string relevancy according to an embodiment of the present invention;
FIG. 4 is a flowchart illustrating step S103 of a method for calculating string relevancy according to an embodiment of the present invention;
FIG. 5 is a flowchart illustrating a method for calculating word string relevancy according to an embodiment of the present invention, wherein feature vectors of words in a word statistical curve are obtained;
FIG. 6 is a flowchart illustrating a method for calculating string relevancy according to an embodiment of the present invention, wherein the method determines relevancy calculation results of a string to be processed according to vocabulary feature vectors;
FIG. 7 is a flowchart illustrating the step S102 of a method for calculating string relevancy according to an embodiment of the present invention;
FIG. 8 is a flowchart illustrating another string relevancy calculation method according to an embodiment of the present invention;
FIG. 9 is a block diagram illustrating a computing system for string relevancy according to an embodiment of the present invention;
fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Icon:
910-a string to be processed acquisition module; 920-word string statistics module; 930-vocabulary statistics module; 940-a relevance calculation module; 101-a processor; 102-a memory; 103-a bus; 104-communication interface.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The research on the relevance of the vocabulary is a basic research subject in natural language processing, and the improvement of the relevance calculation level has important significance in a plurality of application fields such as text clustering, semantic disambiguation, semantic Web, information retrieval and the like. The vocabulary relevance calculation in the text file has been widely applied in the fields of text clustering, information retrieval and the like. However, in the prior art, most of the concerns are about the correlation between a pair of words; and there is mostly one assumption: i.e., the relative terms should at least be based on "co-occurrence". The cross-space related vocabulary has no good effect on the cross-time, and the relevance calculation result of the cross-time and cross-space related vocabulary has large error.
Based on this, the embodiment of the present invention provides a method, a system, and an electronic device for calculating word string relevancy, in which the method may obtain feature vectors of the same time point and the same dimension by screening target words in a target text and counting the occurrence frequency of the target words at the same time point, and determine a relevancy calculation result of a word string to be processed according to the feature vectors of the words, thereby improving relevancy of other words and improving calculation accuracy of relevancy results of related words across time and space.
To facilitate understanding of the present embodiment, first, a detailed description is given of a method for calculating string relevancy according to the present embodiment.
Referring to fig. 1, a flowchart of a method for calculating string relevancy is shown, where the method includes the following steps:
step S101, obtaining word strings to be processed from the target text.
The calculation of the string associations needs to be based on a certain amount of target text, where the target text includes a large amount of text data at different times. In different types of text data, the correlations between words are also different, so that the target text can be a fixed type of text data set or different types of text data sets according to actual requirements. The determined word string relevancy results are different according to different text data samples, but the more the number of the target texts is, the higher the accuracy of the determined word string relevancy results is.
The character string to be processed may be composed of letters, Chinese characters, numbers and punctuation marks, and generally speaking, Chinese characters are mainly used.
And step S102, counting the types and the number of the vocabularies in the character strings to be processed in a preset time period.
The method mainly realizes the statistics of the frequency of the words in the character strings to be processed in a preset time period, and after the target text is determined, the words in the target text need to be separated in the preset time period, namely, a set of the words appearing at different times is obtained. The preset time period may be set to one hour, one minute, or the like.
Step S103, determining a vocabulary statistical curve under the category of the vocabulary according to the category and the number of the vocabulary in the character string to be processed.
The abscissa of the vocabulary statistical curve is time; the ordinate of the vocabulary statistical curve is the number of vocabularies in the character string to be processed. After the types and the number of the vocabularies in the character string to be processed are obtained, the frequency of occurrence of different vocabularies in each time can be determined. And then forming a rectangular coordinate system by taking time as an abscissa and the number of the vocabularies as an ordinate, and respectively representing different vocabularies in the target text in the rectangular coordinate system to obtain different types of vocabulary statistical curves.
Step S104, obtaining the characteristic vector of the vocabulary in the vocabulary statistical curve, and determining the correlation calculation result of the word string to be processed according to the characteristic vector of the vocabulary.
The vocabulary statistical curve contains the frequency of occurrence of the vocabulary and the rule of time change, and the vocabulary statistical curve needs to be further quantized in the actual operation process to determine the relevance of the word string to be processed, for example, the process can be realized by obtaining the characteristic vector of the vocabulary in the vocabulary statistical curve. Specifically, the feature vector represents the features of words in the word string to be processed, and the relevance calculation result of the word string to be processed is accurately calculated by converting the word statistical curve into the feature vector.
In the actual operation process, each continuous vocabulary statistical curve can be processed by adopting a Fourier transform mode, so that the amplitude and the phase of each vocabulary statistical curve can be obtained and used as the parameters of the characteristic vector corresponding to each vocabulary. After the feature vectors are obtained, the relevance calculation result can be obtained by clustering the feature vectors. Specifically, the clustering mode adopted by the clustering process can be set according to actual requirements, and if the clustering conditions are more, the relevance of each vocabulary in the obtained clustering result is stronger; if the clustering conditions are less, the correlation of each vocabulary in the obtained clustering result is weaker. And determining the vocabulary with the same cluster as the related vocabulary according to the clustering result of the feature vector, thereby determining the correlation calculation result of the word string to be processed.
According to the word string relevance calculation method provided in the embodiment, the target words in the target text are screened, the occurrence frequency of the target words at the same time point is counted, the feature vectors at the same time point and the same dimension are obtained, the relevance calculation result of the word string to be processed is determined according to the feature vectors of the words, the relevance of other words is improved, and the calculation accuracy of the relevance result of the related words across time and space is improved.
In some embodiments, the step S101 of obtaining the word string to be processed from the target text, as shown in fig. 2, includes:
step S201, acquiring a target text; the target text is obtained from a text database, a web crawler tool result and a log file.
The target text may be a fixed type of text data set or may be a different type of text data set. The target text is at least obtained from a text database, a web crawler tool and a log file, for example, related text data can be obtained from the text database, and finally, a related target text is obtained; the page can be captured by a web crawler tool to obtain corresponding text data; related log files can also be used directly as target text.
Step S202, reading the target text and traversing all word strings in the target text to obtain word strings corresponding to all words in the target text.
After the target text is obtained, reading operation is required to be performed on the target text, and the reading process may execute different reading strategies according to the size of the text, for example: when the size of the target text file exceeds a preset threshold value, the target text can be read line by line, and the loading speed is increased; and if the target text has no line break character, reading the target text in real time according to a preset reading range. When the size of the target text file does not exceed the preset threshold, the target text can be directly read.
And reading the target text to obtain all word strings in the target text, wherein the word strings contain various vocabularies, and the vocabularies need to be segmented to finally obtain the word strings corresponding to the vocabularies.
Step S203, determining the word string to be processed according to the word strings corresponding to all the words in the target text.
And after the word strings corresponding to all the words in the target text are obtained, the final word strings to be processed can be obtained by combining the segmentation processing result. The result of the segmentation process may include corresponding separators of the segmented words, such as "/n", "/w", "/q", etc., where the letters after different separators correspond to the relevance of the word string.
Since all strings in the target text are usually obtained with useless words, the useless words need to be masked. Therefore, in some embodiments, the step S202 of reading the target text and traversing all the word strings in the target text to obtain the word strings corresponding to all the words in the target text, as shown in fig. 3, includes:
step S301, acquiring a shielding vocabulary list; the shielding vocabulary list is used for shielding the vocabulary of the target text.
The shielding vocabulary list is used as a basis for vocabulary shielding, generally comprises sensitive vocabularies of various industries, can be specifically set according to actual use scenes, and is not repeated again.
Step S302, traverse all word strings in the target text, mask word strings belonging to the masked vocabulary list, and determine word strings corresponding to all vocabularies in the target text.
The step may be regarded as a search process, and in a specific implementation process, all word strings in the target text may be traversed sequentially according to the vocabulary in the shielded vocabulary list, and if the vocabulary in the target text belongs to the shielded vocabulary list, the vocabulary in the target text is deleted until all word strings related to the shielded vocabulary list in all word strings in the target text are deleted.
In some embodiments, the step S103 of determining the vocabulary statistical curve under the category of the vocabulary according to the category and the number of the vocabulary in the word string to be processed, as shown in fig. 4, includes:
step S401, the category and the number of the vocabularies in the character string to be processed in the preset time period are obtained.
After the target text is determined, word segmentation processing needs to be performed on the target text to obtain words corresponding to each target text, then the occurrence frequency of different words in different time needs to be counted in a word set corresponding to each target text, and finally the type and the number of the words in the character string to be processed are obtained.
Step S402, counting vocabulary statistical curves in a preset time period according to the types of the vocabularies.
In the specific implementation process, texts in each day can be integrated, all texts in the day are subjected to word segmentation by using a text word segmentation technology, the types and the number of all words appearing in the day are counted, and the frequency of the words appearing is determined. For example, the statistic is 400 days, so that each word corresponds to a frequency value every day. The horizontal axis of the coordinate axis represents time (days), the vertical axis represents frequency of the vocabulary, the frequency of 400 days corresponding to the vocabulary is displayed in the rectangular coordinate system, 400 discrete points are corresponding, and a vocabulary statistical curve in a preset time period is finally obtained through the discrete points.
In some embodiments, obtaining the feature vectors of the vocabulary in the vocabulary statistics curve, as shown in fig. 5, comprises:
step S501, smoothing the vocabulary statistical curve to obtain a continuous smooth curve of the vocabulary.
Because the obtained vocabulary statistical curve is composed of discrete points, the curve is smoother by smoothing the vocabulary statistical curve, and the subsequent calculation precision is favorably improved. In the specific implementation process, the smoothing treatment can be carried out by adopting a sliding average value.
Step S502, Fourier transform is carried out on the continuous smooth curve of the vocabulary, and a transform result of the continuous smooth curve is obtained; wherein the transformation result is in complex form.
Performing Fourier transform on the continuous smooth curve to obtain a complex form of transform result; for example, the statistical data is 400 days, and at this time, the number of discrete points corresponding to the frequency data of the vocabulary in 400 days is 400, and finally 400 complex numbers are obtained.
Step S503, calculating the amplitude of the transformation result, and determining the feature vector of the vocabulary according to the amplitude of the transformation result.
Since the 400 complex numbers are fourier-transformed components, the amplitude and phase of the fourier-transformed components can be calculated by calculating the real part and imaginary part of the complex numbers, and the feature vector of the vocabulary is finally determined according to the amplitude as the feature of the components.
In some embodiments, determining the result of calculating the relevancy of the word string to be processed according to the feature vector of the vocabulary, as shown in fig. 6, includes:
step S601, a feature vector of a vocabulary is acquired.
In the specific implementation process, the amplitude result of the Fourier transform elimination can be used as the feature vector of the vocabulary.
Step S602, clustering operation is carried out on the characteristic vectors of the vocabulary, and the clustering result of the vocabulary is determined.
And performing aggregation operation on the feature vectors, wherein related models such as k-means clustering operation and the like can be adopted, and the amplitudes of the feature vectors of different vocabularies can also be compared, and the vocabularies with the similarity larger than a preset threshold value are used as clustering results.
Step S603, determining a result of calculating the relevancy of the word string to be processed according to the clustering result of the vocabulary.
In some embodiments, the step S102 of counting the category and number of words in the word string to be processed within a preset time period may also be implemented by using a relevant artificial intelligence model, as shown in fig. 7, including:
in step S701, a string to be processed within a predetermined time period is obtained.
Step S702, inputting the word string to be processed into a preset vocabulary segmentation model, and obtaining all vocabularies contained in the word string to be processed.
The vocabulary segmentation model in the step is an artificial intelligent model, and word segmentation results can be directly output after word strings to be processed are input into the model. The artificial intelligence model can be obtained by utilizing the existing convolutional neural network training, and is not described in detail herein.
Step S703, counting all the words in the word string to be processed, and determining the type and number of the words in the word string to be processed.
In the following, a detailed description is given to a calculation process of string relevancy by using a flowchart of another calculation method of string relevancy shown in fig. 8, specifically, the method includes:
in step S801, a string to be processed is input.
This step is an initialization step, and the word string to be processed is obtained from the target text. For example, the target text is from a web crawler to obtain text data for 400 consecutive days, and the data is stored by date, in units of days.
Step S802, counting the frequency of the vocabulary.
Integrating the word strings to be processed every day, and performing word segmentation on all the word strings to be processed on the same day by using a text word segmentation technology to obtain the frequency of the words appearing on the same day. Because the statistical data is 400 days, the vocabulary corresponds to a frequency value every day; if a certain vocabulary is not present on a certain day, the frequency of the day is 0.
In step S803, it is determined whether the vocabulary is a stop word.
If the vocabulary is the stop word, the processing procedure of the vocabulary is exited, and the next vocabulary is executed; if the vocabulary is not a stop word, step S804 is performed.
Step S804, determine the vocabulary statistical curve of the vocabulary.
The horizontal axis of the coordinate axes represents time in units of: days; the vertical axis represents the frequency of the vocabulary in units of: the number of times. The frequency of 400 days corresponding to the vocabulary is displayed in a rectangular coordinate system, and 400 discrete points are corresponding to the frequency. Connecting 400 discrete points to obtain the vocabulary statistical curve of the vocabulary.
In step S805, the vocabulary statistical curve is smoothed.
And smoothing 400 discrete frequency points in the rectangular coordinate system by using a smoothing technology to form a continuous curve, so that the waveform can be further analyzed by using a waveform processing tool conveniently.
And step S806, performing Fourier transform on the smoothed vocabulary statistical curve to obtain a feature vector.
The lexical waveforms are processed using fourier transforms to yield a 400-dimensional complex number, in the form of a + bi. The 400 complex numbers are fourier-transformed components, where a is the real part of the complex number and b is the imaginary part of the complex number, and the amplitude and phase of the fourier-transformed components can be found from the real and imaginary parts of the complex number. In the specific implementation process, the amplitude can be selected as the feature of the component, and finally the feature vector of the vocabulary is obtained.
In step S807, a clustering operation is performed on the feature vectors.
And the amplitude features extracted by the Fourier components are used as feature vectors, clustering operation is carried out according to the feature vectors, and the feature vectors are clustered into related words of the same class. The specific results are as follows:
"true/a, flap/v, Brazil/ns, man/n, ride/v, motorcycle/n, quilt/p, car/n, bump/v, fall/v, overpass/n/d, odds/n, survival/v".
"Driving/v, Motor vehicle/n, violation/v, road/n, traffic/n, Signal/n, traffic/v, (/ w, penalty/v, 150/m, Yuan/q,/w, Note/v, 6/m, min/q),/w".
In the feature vector extraction result of the related vocabulary,/a represents; n represents a noun; v denotes a verb; /ns denotes country; p represents a relational term; d represents a mood word; u represents the word "of"; w represents a symbol; and/q represents a unit. Therefore, according to the extraction result, the character strings can be conveniently associated.
According to the word string relevance calculation method provided in the embodiment, the target words in the target text are screened, the occurrence frequency of the target words at the same time point is counted, the feature vectors at the same time point and the same dimension are obtained, the relevance calculation result of the word string to be processed is determined according to the feature vectors of the words, the relevance of other words is improved, and the calculation accuracy of the relevance result of the related words across time and space is improved.
Corresponding to the above method embodiment, an embodiment of the present invention further provides a system for calculating string relevancy, a schematic structural diagram of which is shown in fig. 9, and the system includes:
a to-be-processed word string obtaining module 910, configured to obtain a to-be-processed word string from a target text;
a word string statistics module 920, configured to count the types and the number of words in a word string to be processed in a preset time period;
the vocabulary statistics module 930 is used for determining a vocabulary statistics curve under the category of the vocabulary according to the category and the number of the vocabulary in the character string to be processed; wherein, the abscissa of the vocabulary statistical curve is time; the vertical coordinate of the vocabulary statistical curve is the number of vocabularies in the character string to be processed;
and the relevance calculating module 940 is configured to obtain the feature vector of the vocabulary in the vocabulary statistical curve, and determine a relevance calculating result of the word string to be processed according to the feature vector of the vocabulary.
The system for calculating string relevancy according to the embodiment of the present invention has the same technical features as the method for calculating string relevancy according to the above embodiment, so that the same technical problems can be solved, and the same technical effects can be achieved. For the sake of brevity, where not mentioned in the examples section, reference may be made to the corresponding matter in the preceding method examples.
The embodiment also provides an electronic device, a schematic structural diagram of which is shown in fig. 10, and the electronic device includes a processor 101 and a memory 102; the memory 102 is used for storing one or more computer instructions, and the one or more computer instructions are executed by the processor to implement the method for calculating the word string relevance.
The electronic device shown in fig. 10 further includes a bus 103 and a communication interface 104, and the processor 101, the communication interface 104, and the memory 102 are connected through the bus 103.
The Memory 102 may include a high-speed Random Access Memory (RAM) and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Bus 103 may be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 10, but this does not indicate only one bus or one type of bus.
The communication interface 104 is configured to connect with at least one user terminal and other network units through a network interface, and send the packaged IPv4 message or IPv4 message to the user terminal through the network interface.
The processor 101 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 101. The Processor 101 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component. The various methods, steps, and logic blocks disclosed in the embodiments of the present disclosure may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present disclosure may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 102, and the processor 101 reads the information in the memory 102 and completes the steps of the method of the foregoing embodiment in combination with the hardware thereof.
Embodiments of the present invention further provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, performs the steps of the method of the foregoing embodiments.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present invention or a part thereof, which essentially contributes to the prior art, can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A method for calculating word string relevancy, the method comprising:
acquiring a word string to be processed from a target text;
counting the types and the number of vocabularies in the character strings to be processed within a preset time period;
determining a vocabulary statistical curve under the category of the vocabulary according to the category and the number of the vocabulary in the character string to be processed; wherein the abscissa of the vocabulary statistical curve is time; the ordinate of the vocabulary statistical curve is the number of the vocabularies in the word strings to be processed;
and acquiring the characteristic vector of the vocabulary in the vocabulary statistical curve, and determining the correlation calculation result of the word string to be processed according to the characteristic vector of the vocabulary.
2. The method for calculating string relevancy according to claim 1, wherein the step of obtaining the string to be processed from the target text includes:
acquiring a target text; the target text is obtained from a text database, a web crawler tool result and a log file;
reading the target text and traversing all word strings in the target text to obtain word strings corresponding to all words in the target text;
and determining the word string to be processed according to the word strings corresponding to all the words in the target text.
3. The method for calculating word string relevancy according to claim 2, wherein the step of reading the target text and traversing all word strings in the target text to obtain word strings corresponding to all vocabularies in the target text comprises:
acquiring a shielding vocabulary list; the shielding vocabulary list is used for shielding the vocabulary of the target text;
traversing all word strings in the target text, shielding the word strings belonging to the shielding vocabulary list, and determining the word strings corresponding to all the words in the target text.
4. The method for calculating string relevancy according to claim 1, wherein the step of determining a vocabulary statistical curve under the category of the vocabulary according to the category and the number of the vocabulary in the word string to be processed comprises:
acquiring the type and the number of the words in the word strings to be processed within the preset time period;
and counting the vocabulary statistical curve in the preset time period according to the category of the vocabulary.
5. The method for calculating string relevancy according to claim 1, wherein the obtaining feature vectors of the words in the word statistical curve comprises:
smoothing the vocabulary statistical curve to obtain a continuous smooth curve of the vocabulary;
performing Fourier transformation on the continuous smooth curve of the vocabulary to obtain a transformation result of the continuous smooth curve; wherein the transformation result is in a complex form;
and calculating the amplitude of the transformation result, and determining the feature vector of the vocabulary according to the amplitude of the transformation result.
6. The method for calculating string relevancy according to claim 1, wherein determining relevancy calculation results of strings to be processed according to the feature vectors of the vocabularies comprises:
acquiring a feature vector of the vocabulary;
clustering the characteristic vectors of the vocabularies to determine clustering results of the vocabularies;
and determining the correlation calculation result of the word string to be processed according to the clustering result of the vocabulary.
7. The method for calculating string relevancy according to claim 1, wherein the step of counting the number and the type of words in the string to be processed within a predetermined time period comprises:
acquiring the word strings to be processed in a preset time period;
inputting the word string to be processed into a preset vocabulary segmentation model to obtain all vocabularies contained in the word string to be processed;
and counting all the words in the word string to be processed, and determining the types and the number of the words in the word string to be processed.
8. A system for computing string associations, the system comprising:
the word string to be processed acquiring module is used for acquiring a word string to be processed from the target text;
the word string counting module is used for counting the types and the number of the words in the word string to be processed within a preset time period;
the vocabulary statistics module is used for determining a vocabulary statistics curve under the category of the vocabulary according to the category and the number of the vocabulary in the word string to be processed; wherein the abscissa of the vocabulary statistical curve is time; the ordinate of the vocabulary statistical curve is the number of the vocabularies in the word strings to be processed;
and the relevance calculating module is used for acquiring the characteristic vector of the vocabulary in the vocabulary statistical curve and determining the relevance calculating result of the word string to be processed according to the characteristic vector of the vocabulary.
9. An electronic device, comprising: a processor and a storage device; the storage device has a computer program stored thereon, which when executed by the processor implements the steps of the method for calculating string relevancy according to any one of claims 1 to 7.
10. A computer-readable storage medium, having a computer program stored thereon, where the computer program is executed by a processor to implement the steps of the method for calculating string relevancy according to any one of the above claims 1 to 7.
CN202110917193.7A 2021-08-11 2021-08-11 Word string relevance calculation method and system and electronic equipment Pending CN113515954A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110917193.7A CN113515954A (en) 2021-08-11 2021-08-11 Word string relevance calculation method and system and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110917193.7A CN113515954A (en) 2021-08-11 2021-08-11 Word string relevance calculation method and system and electronic equipment

Publications (1)

Publication Number Publication Date
CN113515954A true CN113515954A (en) 2021-10-19

Family

ID=78068118

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110917193.7A Pending CN113515954A (en) 2021-08-11 2021-08-11 Word string relevance calculation method and system and electronic equipment

Country Status (1)

Country Link
CN (1) CN113515954A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776751A (en) * 2016-11-22 2017-05-31 上海智臻智能网络科技股份有限公司 The clustering method and clustering apparatus of a kind of data
CN109635299A (en) * 2018-12-13 2019-04-16 北京锐安科技有限公司 Vocabulary correlation determines method, apparatus, equipment and computer readable storage medium
KR20210017632A (en) * 2019-08-09 2021-02-17 주식회사 한화 Apparatus and method for performing debugging text and keyword mining

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776751A (en) * 2016-11-22 2017-05-31 上海智臻智能网络科技股份有限公司 The clustering method and clustering apparatus of a kind of data
CN109635299A (en) * 2018-12-13 2019-04-16 北京锐安科技有限公司 Vocabulary correlation determines method, apparatus, equipment and computer readable storage medium
KR20210017632A (en) * 2019-08-09 2021-02-17 주식회사 한화 Apparatus and method for performing debugging text and keyword mining

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李强: "基于共振理论的词汇相关性计算", 中国优秀硕士学位论文全文数据库, pages 138 - 1428 *

Similar Documents

Publication Publication Date Title
CN108319668B (en) Method and equipment for generating text abstract
CN109101620B (en) Similarity calculation method, clustering method, device, storage medium and electronic equipment
CN109299280B (en) Short text clustering analysis method and device and terminal equipment
WO2022095374A1 (en) Keyword extraction method and apparatus, and terminal device and storage medium
CN108108426B (en) Understanding method and device for natural language question and electronic equipment
US11372942B2 (en) Method, apparatus, computer device and storage medium for verifying community question answer data
CN111460148A (en) Text classification method and device, terminal equipment and storage medium
CN110134777B (en) Question duplication eliminating method and device, electronic equipment and computer readable storage medium
CN111460153A (en) Hot topic extraction method and device, terminal device and storage medium
CN109726391B (en) Method, device and terminal for emotion classification of text
CN111177375B (en) Electronic document classification method and device
CN110083832B (en) Article reprint relation identification method, device, equipment and readable storage medium
CN114970525B (en) Text co-event recognition method, device and readable storage medium
CN112329460A (en) Text topic clustering method, device, equipment and storage medium
CN111241271B (en) Text emotion classification method and device and electronic equipment
CN108804550B (en) Query term expansion method and device and electronic equipment
CN115496070A (en) Parallel corpus data processing method, device, equipment and medium
CN112417154B (en) Method and device for determining similarity of documents
CN113515954A (en) Word string relevance calculation method and system and electronic equipment
CN113836918A (en) Document searching method and device, computer equipment and computer readable storage medium
CN113934842A (en) Text clustering method and device and readable storage medium
CN111611379A (en) Text information classification method, device, equipment and readable storage medium
CN111401034A (en) Text semantic analysis method, semantic analysis device and terminal
CN115878759B (en) Text searching method, device and storage medium
CN113283229B (en) Text similarity calculation method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination