CN113821601A - Text comparison method, device, equipment and medium - Google Patents

Text comparison method, device, equipment and medium Download PDF

Info

Publication number
CN113821601A
CN113821601A CN202111131481.6A CN202111131481A CN113821601A CN 113821601 A CN113821601 A CN 113821601A CN 202111131481 A CN202111131481 A CN 202111131481A CN 113821601 A CN113821601 A CN 113821601A
Authority
CN
China
Prior art keywords
text
target
compared
sentence
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111131481.6A
Other languages
Chinese (zh)
Inventor
陈家栋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhongjing Huizhong Technology Co ltd
Original Assignee
Beijing Zhongjing Huizhong Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongjing Huizhong Technology Co ltd filed Critical Beijing Zhongjing Huizhong Technology Co ltd
Priority to CN202111131481.6A priority Critical patent/CN113821601A/en
Publication of CN113821601A publication Critical patent/CN113821601A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure provides a text comparison method, a text comparison device, a computer device and a medium. The text comparison method can comprise the following steps: acquiring a target vector corresponding to a target text; aiming at each text to be compared in a plurality of texts to be compared, obtaining a vector to be compared corresponding to the text to be compared; primarily screening a plurality of texts to be compared to obtain a plurality of primarily screened texts with the similarity between the texts to be compared and the target text and the target vector corresponding to each text to be compared; acquiring text characteristics of a target text and text characteristics of each of a plurality of preliminary screening texts; and screening at least one conflict text which is semantically conflicted with the target text from the plurality of preliminary screened texts based on the corresponding text characteristics.

Description

Text comparison method, device, equipment and medium
Technical Field
The present disclosure relates to the field of artificial intelligence, and in particular, to the field of natural language processing, and more particularly, to a text comparison method, apparatus, computer device, and medium.
Background
In the technical field of natural language processing, the application of judging the similarity between two texts is very wide. Meanwhile, in some applications, there is also a need to determine whether there is a conflict between two or more texts, for example, whether there is a conflict in comparing a plurality of data texts, whether there is a conflict in a regulation system in the same field, and the like. When the number of texts to be compared is large, how to improve the contrast efficiency between the texts is very important.
The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, unless otherwise indicated, the problems mentioned in this section should not be considered as having been acknowledged in any prior art.
Disclosure of Invention
The present disclosure provides a text comparison method, apparatus, computer device, computer readable storage medium and computer program product.
According to an aspect of the present disclosure, there is provided a text comparison method, including: acquiring a target vector corresponding to a target text; aiming at each text to be compared in a plurality of texts to be compared, obtaining a vector to be compared corresponding to the text to be compared; preliminarily screening a plurality of texts to be compared to obtain a plurality of preliminarily screened texts with similarity greater than a first preset threshold value with the target texts on the basis of the target vectors corresponding to the target texts and the vectors to be compared corresponding to each text to be compared; acquiring text features of the target text and text features of each of the plurality of preliminary screening texts; and screening at least one conflict text which is semantically conflicted with the target text from the plurality of preliminary screened texts based on the corresponding text characteristics.
According to another aspect of the present disclosure, there is provided a text comparison apparatus, the apparatus including: the first acquisition module is configured to acquire a target vector corresponding to a target text; the second acquisition module is configured to acquire a to-be-compared vector corresponding to each to-be-compared text in the plurality of to-be-compared texts; the first screening module is configured to preliminarily screen a plurality of texts to be compared to obtain a plurality of preliminarily screened texts with similarity greater than a first preset threshold value with the target text based on a target vector corresponding to the target text and a vector to be compared corresponding to each text to be compared; a third obtaining module configured to obtain a text feature of the target text and a text feature of each of the plurality of preliminary screening texts; and a second screening module configured to screen at least one conflicting text that semantically conflicts with the target text from the plurality of preliminary screened texts based on the corresponding text features.
According to another aspect of the present disclosure, there is provided a computer device including: a memory, a processor and a computer program stored on the memory, wherein the processor is configured to execute the computer program to implement the steps of the above method.
According to another aspect of the present disclosure, a non-transitory computer-readable storage medium is provided, having a computer program stored thereon. Which when executed by a processor implements the steps of the above-described method.
According to another aspect of the disclosure, a computer program product is provided, comprising a computer program. Which when executed by a processor implements the steps of the above-described method.
According to one or more embodiments of the disclosure, texts possibly with conflicts are firstly screened and obtained based on machine learning, and conflict texts are further screened and obtained, so that the workload of manual review can be reduced, and the efficiency and accuracy of text comparison are improved.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the embodiments and, together with the description, serve to explain the exemplary implementations of the embodiments. The illustrated embodiments are for purposes of illustration only and do not limit the scope of the claims. Throughout the drawings, identical reference numbers designate similar, but not necessarily identical, elements.
FIG. 1 shows a flow diagram of a text comparison method according to an embodiment of the present disclosure;
FIG. 2 shows a flowchart of a method for obtaining a target vector corresponding to a target text according to an embodiment of the present disclosure;
fig. 3 is a flowchart illustrating a method for obtaining vectors to be compared corresponding to texts to be compared according to an embodiment of the present disclosure;
FIG. 4 shows a block diagram of a text comparison apparatus according to an embodiment of the present disclosure; and
FIG. 5 shows a block diagram of an exemplary computer device to which exemplary embodiments of the present disclosure can be applied.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In the present disclosure, unless otherwise specified, the use of the terms "first", "second", etc. to describe various elements is not intended to limit the positional relationship, the timing relationship, or the importance relationship of the elements, and such terms are used only to distinguish one element from another. In some examples, an element and a second element may point to the same instance of the element, while in some cases they may also point to different instances based on the context of the description.
The terminology used in the description of the various described examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, if the number of elements is not specifically limited, the elements may be one or more. Furthermore, the term "and/or" as used in this disclosure is intended to encompass any and all possible combinations of the listed items.
In the related text comparison technology, especially for comparing texts in professional fields, such as text comparison of legal documents, a plurality of texts are read one by one in a manual mode, so as to determine whether conflicts exist among the texts. The manual comparison mode results in low comparison efficiency, and simultaneously, conflict and omission occur, so that the accuracy of the comparison result is influenced.
In order to solve the problems, the text which possibly has conflict is firstly screened based on machine learning, and the conflicting text is further screened, so that the workload of manual review can be reduced, and the efficiency and the accuracy of text comparison are improved.
The text comparison method of the present disclosure will be further described below with reference to the accompanying drawings.
Fig. 1 shows a flowchart of a text comparison method according to an exemplary embodiment of the present disclosure. As shown in fig. 1, the text comparison method 100 includes: s101, acquiring a target vector corresponding to a target text; step S102, aiming at each text to be compared in a plurality of texts to be compared, obtaining a vector to be compared corresponding to the text to be compared; step S103, preliminarily screening a plurality of texts to be compared to obtain a plurality of preliminarily screened texts with similarity between the texts to be compared and the target texts being greater than a first preset threshold value based on the target vector corresponding to the target text and the vector to be compared corresponding to each text to be compared; s104, acquiring text characteristics of the target text and text characteristics of each of the plurality of preliminary screening texts; and S105, screening at least one conflict text which is semantically conflicted with the target text from the plurality of preliminary screening texts based on corresponding text characteristics.
The inventors have found that there is a correlation in content between texts where there is a conflict, and there is less likelihood of a semantic conflict between texts where the content is totally irrelevant. Therefore, the technical scheme in the embodiment of the disclosure can eliminate texts irrelevant to the target text based on comparison of text similarity, and obtain a plurality of preliminary screening texts possibly conflicting with the target text. And then, based on the comparison of the text features, at least one conflict text with the target text semantic conflict is further screened out from the plurality of preliminary screened texts. The machine learning is utilized to carry out two-stage screening to screen out the conflict texts, so that the text comparison efficiency and accuracy can be improved, and the workload of manual review is reduced.
According to some embodiments, the method further comprises: and constructing a preset word vector library, wherein the word vector library comprises a plurality of candidate words and a word vector corresponding to each candidate word. Further, a target vector corresponding to the target text and a to-be-compared vector corresponding to each to-be-compared text may be obtained based on the word vector library. Therefore, the distribution condition of a plurality of candidate words in the vector space is predetermined, so that the vector of the corresponding text can be quickly determined based on the word vector library to calculate the relevance.
According to some embodiments, the word vector model may be pre-trained using a large amount of corpora, and the pre-trained word vector model may be fine-tuned using a small amount of sample text of the same type or field as the target text and the text to be compared, thereby completing model training. Therefore, a word vector library suitable for a certain type of text or a certain field of text can be constructed by utilizing the word vector model obtained by training.
For example, when the target text and the text to be compared are both the legal documents in a certain field, a small amount of sample texts in the legal documents in the field can be used for fine tuning the pre-trained word vector model. And constructing a preset word vector library by using the word vector model obtained by training. When a new regulation file exists in the field, a small amount of samples in the new regulation file can be regularly utilized to finely adjust the word vector model. Aiming at the newly added regulation document, the word vector model obtained by training is used for updating the preset word vector library, so that the stability and the accuracy of the text comparison method can be improved.
It is understood that the text compared by the present disclosure is not limited to the legal documents, and the text comparison method provided by the present disclosure can also be used for comparing other types of text, such as comparing whether conflicts exist between financial statements containing data, whether conflicts exist between textbooks of the same subject, and the like. The present disclosure is not limited as to the type of text that is compared.
Fig. 2 shows a flowchart of a method for obtaining a target sentence vector corresponding to each of at least one sentence included in a target text according to an embodiment of the present disclosure.
According to some embodiments, as shown in fig. 2, in the case of constructing a preset word vector library, the step S101 of obtaining a target vector corresponding to a target text may include: step S201, segmenting the target text to obtain at least one sentence included in the target text; step S202, obtaining at least one target word included in each sentence in at least one sentence included in the target text; step S203, obtaining a word vector corresponding to at least one target word included in each statement from the word vector library; step S204, determining a target sentence vector corresponding to each statement based on a word vector corresponding to each at least one target word included in each statement; and step S205, determining a target vector corresponding to the target text based on a target sentence vector corresponding to each of at least one sentence included in the target text.
Therefore, at least one target word in each sentence can be obtained, a target sentence vector is determined based on the word vector library, and a target vector corresponding to a target text is determined based on the target sentence vector for comparison of similarity of subsequent texts.
For example, each sentence may be word-cut to obtain a plurality of words, and each of the plurality of words may be matched with a plurality of candidate words in a word vector library. The word may be determined to be the target word in response to determining that a candidate word matching the word is included in the library of word vectors. Further, a word vector of a candidate word matching the target word may be determined as the word vector of the target word. The word vectors of at least one target word may be accumulated and/or concatenated to obtain a target sentence vector corresponding to the sentence.
In one example, in step S201, the target text may be segmented based on periods in the target text to obtain at least one sentence included in the target text. In another example, the target text may be segmented based on paragraphs. The specific segmentation mode may be determined according to an application scenario and an actual requirement, which is not limited in this disclosure.
Fig. 3 shows a flowchart of a method for obtaining a to-be-compared vector corresponding to a to-be-compared text according to an embodiment of the present disclosure.
According to some embodiments, as shown in fig. 3, in the case of constructing a preset word vector library, the step S102 of obtaining a to-be-compared vector corresponding to a to-be-compared text includes: step S301, segmenting the text to be compared to obtain at least one sentence included in the text to be compared; step S302, acquiring at least one word to be compared, which is included in each sentence in at least one sentence included in the text to be compared; step S303, obtaining a word vector to be compared corresponding to at least one word to be compared included in each statement from the word vector library; step S304, determining a sentence vector to be compared corresponding to each sentence based on the respective corresponding word vector of at least one word to be compared included in each sentence; and step S305, determining a to-be-compared vector corresponding to the to-be-compared text based on the to-be-compared sentence vector corresponding to each of the at least one sentence included in the to-be-compared text.
Therefore, the vectors to be compared are obtained by using a method similar to the method for obtaining the target vectors, and are used for comparing the similarity between the target text and the text to be compared.
The process realized in step S301 to step S305 is similar to the process realized in step S201 to step S205, and the details of the disclosure are omitted here.
According to one embodiment, the preliminary screening to obtain the plurality of preliminary screening texts may be implemented by, but is not limited to, the following steps: calculating cosine similarity between the target text and the text to be compared based on the target vector corresponding to the target text and the vector to be compared corresponding to each text to be compared; and determining a plurality of preliminary screening texts based on the calculated cosine similarity. Therefore, the similarity between the corresponding texts is determined based on the cosine similarity between the vectors corresponding to the texts, and is used for screening out a plurality of preliminary screening texts of which the similarity with the target text is greater than a first preset threshold value.
After a plurality of preliminary screening texts which may conflict with the target text are obtained by screening, step S104 is executed to obtain text features of the target text and text features of each of the plurality of preliminary screening texts.
According to some embodiments, the text feature of the target text is determined based on the respective text feature of at least one sentence included in the target text, and the text feature of each of the elementary texts is determined based on the respective text feature of at least one sentence included in the elementary text. It can be understood that the text features of the sentences are easier to quantify and more accurate than the text features of the texts, so that the difficulty of quantifying the features and more accurate description of the text features can be reduced by determining the text features of the texts according to the text features of the sentences, and the accuracy of text comparison is further improved.
According to some embodiments, the textual features of a sentence include at least one of: the sentence comprises a subject word, a grammatical structure of the sentence, a number range and a number size of the sentence, and a length of the sentence, wherein the subject word is defined in a preset dictionary. The text features of the sentences listed in the present disclosure perform well in the process of identifying conflicting sentences, and the text features of other sentences may also be set according to the types of texts to be compared, which is not limited by the present disclosure.
According to some embodiments, the step S105 of filtering, based on the corresponding text features, at least one conflicting text that semantically conflicts with the target text from the plurality of preliminary screened texts includes: for each of the plurality of preliminary screened texts, determining a conflict sentence which is contained in the preliminary screened text and has a semantic conflict with the target text based on the text features of the preliminary screened text and the text features of the target text; and in response to determining that the number of corresponding conflicting sentences is greater than a second preset threshold, determining the preliminary screened text as conflicting text. Therefore, by determining the number of the conflict sentences, the conflict texts which are in semantic conflict with the target text are screened out.
According to one embodiment, the conflict sentences contained in the preliminary screening text and having semantic conflict with the target text can be determined based on preset comparison rules for each text feature of the sentence. For example, when the coincidence degree of the subject words included in the two sentences is in a preset interval, the two sentences aiming at the feature of the subject words included in the sentences can be determined as conflicting texts; when one of the two sentences is a positive sentence and the other sentence is a negative sentence, it can be determined that the two sentences are conflicting sentences with respect to the feature of the grammatical structure of the sentence. And determining the conflict sentences contained in the preliminary screening texts and having semantic conflict with the target text by combining the comparison rules of at least two text features. For example, when the absolute value of the difference between the lengths (e.g., the number of included words) of the two sentences is not greater than the preset threshold, it is further determined whether the degree of coincidence of subject words included in the two sentences is within a preset interval, and when the absolute value of the difference between the lengths (e.g., the number of included words) of the two sentences is greater than the preset threshold, it is directly determined that the two sentences are not conflicting texts.
According to another aspect of the present disclosure, a text comparison apparatus is provided. As shown in fig. 4, the text comparison apparatus 400 includes: a first obtaining module 401 configured to obtain a target vector corresponding to a target text; a second obtaining module 402, configured to obtain, for each text to be compared in a plurality of texts to be compared, a vector to be compared corresponding to the text to be compared; a first screening module 403, configured to preliminarily screen, based on a target vector corresponding to the target text and a to-be-compared vector corresponding to each to-be-compared text, a plurality of preliminarily screened texts with similarity greater than a first preset threshold with the target text from the plurality of to-be-compared texts; a third obtaining module 404, configured to obtain a text feature of the target text and a text feature of each of the plurality of preliminary screening texts; and a second filtering module 405 configured to filter at least one conflicting text that semantically conflicts with the target text from the plurality of preliminary screened texts based on the corresponding text features.
Therefore, the first filtering module 403 excludes texts irrelevant to the target text based on the comparison of the similarity of the texts, and obtains a plurality of preliminary filtered texts possibly conflicting with the target text. At least one conflicting text that semantically conflicts with the target text is then further refined from the plurality of preliminary screened texts by a second screening module 405 based on the comparison of the text features. The machine learning is utilized to carry out two-stage screening to screen out the conflict texts, so that the text comparison efficiency and accuracy can be improved, and the workload of manual review is reduced.
The operation of the module 401 and 405 of the text comparison apparatus 400 is similar to the operation of the steps S101-S105 described above, and is not repeated herein.
According to some embodiments, the apparatus further comprises: the word vector constructing method comprises the steps of constructing a preset word vector library, wherein the word vector library comprises a plurality of candidate words and word vectors corresponding to the candidate words. Thus, the first obtaining module 401 and the second obtaining module 402 can obtain a target vector corresponding to the target text and a to-be-compared vector corresponding to each to-be-compared text respectively based on the word vector library. Therefore, the distribution condition of a plurality of candidate words in the vector space is predetermined, so that the vector of the corresponding text can be quickly determined based on the word vector library to calculate the relevance.
According to some embodiments, the building module may pre-train the word vector model using a large amount of corpora, and perform fine tuning on the pre-trained word vector model using a small amount of sample texts of the same type or the same field as the target text and the text to be compared, thereby completing model training. Therefore, a word vector library suitable for a certain type of text or a certain field of text can be constructed by utilizing the word vector model obtained by training. For example, when the target text and the text to be compared are both the legal documents in a certain field, a small amount of sample texts in the legal documents in the field can be used for fine tuning the pre-trained word vector model. And constructing a preset word vector library by using the word vector model obtained by training. When a new regulation file exists in the field, a small amount of samples in the new regulation file can be regularly utilized to finely adjust the word vector model. Aiming at the newly added regulation document, the word vector model obtained by training is used for updating the preset word vector library, so that the stability and the accuracy of the text comparison method can be improved.
It will be appreciated that the text compared by the apparatus is not limited to a regulation document, and may be compared to other types of text, such as comparing whether there is a conflict between financial statements containing data, whether there is a conflict between textbooks of the same subject, and so on. The present disclosure is not limited as to the type of text that is compared.
According to some embodiments, the first obtaining module 401 comprises: the first segmentation unit is configured to segment the target text to obtain at least one sentence included in the target text; a first obtaining unit configured to obtain, for each of at least one sentence included in the target text, at least one target word included in the sentence; the second obtaining unit is configured to obtain a word vector corresponding to each of at least one target word included in each statement from the word vector library; the first determining unit is configured to determine a corresponding target sentence vector of each sentence based on a corresponding word vector of at least one target word included in each sentence; and a second determining unit configured to determine a target vector corresponding to the target text based on a target sentence vector corresponding to each of at least one sentence included in the target text. Therefore, at least one target word in each sentence can be acquired by the first acquisition unit, the sentence vector of the target sentence is determined by the first determination unit based on the word vector library, and the target vector corresponding to the target text is determined by the second determination unit based on the target sentence vector for comparison of the similarity of subsequent texts.
For example, the first obtaining unit may perform word segmentation on each sentence to obtain a plurality of words, and the second obtaining unit matches each of the plurality of words with a plurality of candidate words in a word vector library. The word may be determined to be the target word in response to determining that a candidate word matching the word is included in the library of word vectors. Further, the second obtaining unit may determine a word vector of a candidate word matching the target word as the word vector of the target word. The first determining unit may accumulate and/or concatenate the word vectors of at least one target word to obtain a target sentence vector corresponding to the sentence.
According to some embodiments, the second obtaining module 402 comprises: the second segmentation unit is configured to segment the text to be compared so as to obtain at least one sentence included in the text to be compared; the third obtaining unit is configured to obtain, for each of at least one sentence included in the text to be compared, at least one word to be compared included in the sentence; the fourth obtaining unit is configured to obtain a to-be-compared word vector corresponding to each of at least one to-be-compared word included in each statement from the word vector library; the third determining unit is configured to determine a sentence vector to be compared corresponding to each sentence based on a respective corresponding word vector of at least one word to be compared included in each sentence; and the fourth determining unit is configured to determine a to-be-compared vector corresponding to the to-be-compared text based on the to-be-compared sentence vector corresponding to each of at least one sentence included in the to-be-compared text.
Therefore, the word vectors corresponding to the candidate words acquired from the word vector library by the fourth acquiring unit are used for acquiring the vectors of the sentences to be compared by the third determining unit, and then the fourth determining unit determines the corresponding vectors of the texts to be compared based on the vectors of the sentences to be compared for similarity comparison between the target text and the texts to be compared.
According to one embodiment, the first filtering module 403 comprises: the calculating unit is configured to calculate cosine similarity between the target text and the text to be compared based on the target vector corresponding to the target text and the vector to be compared corresponding to each text to be compared; and a seventh determining unit configured to determine a plurality of preliminary screening texts based on the calculated cosine similarity. Therefore, the similarity between the corresponding texts is determined based on the cosine similarity between the vectors corresponding to the texts calculated by the calculating unit, and the seventh determining unit determines a plurality of preliminary screening texts with the similarity between the texts and the target text being greater than a first preset threshold.
After the first filtering module 403 filters a plurality of elementary texts that may conflict with the target text, the third obtaining module 404 obtains the text features of the target text and the text features of each of the elementary texts.
According to some embodiments, the third obtaining module is configured to determine the text feature of the target text based on the respective text feature of the at least one sentence included in the target text and determine the text feature of each of the first screened texts based on the respective text feature of the at least one sentence included in the first screened text. It can be understood that the text features of the sentences are easier to quantify and more accurate than the text features of the texts, so that the difficulty of quantifying the features and more accurate description of the text features can be reduced by determining the text features of the texts according to the text features of the sentences, and the accuracy of text comparison is further improved.
According to some embodiments, the textual features of a sentence include at least one of: the sentence comprises a subject word, a grammatical structure of the sentence, a number range and a number size of the sentence, and a length of the sentence, wherein the subject word is defined in a preset dictionary. The text features of the sentences listed in the present disclosure perform well in the process of identifying conflicting sentences, and the text features of other sentences may also be set according to the types of texts to be compared, which is not limited by the present disclosure.
According to some embodiments, the second screening module 405 comprises: a fifth determining unit, configured to determine, for each of the plurality of preliminary screened texts, a conflict sentence included in the preliminary screened text and having a semantic conflict with the target text based on a text feature of the preliminary screened text and a text feature of the target text; and a sixth determining unit configured to determine the preliminary screened text as a conflicting text in response to determining that the number of corresponding conflicting sentences is greater than a second preset threshold. Therefore, the number of the conflict sentences is determined by the sixth determining unit, and the conflict texts which are semantically in conflict with the target text are screened out.
According to an aspect of the disclosure, a computer device is provided that includes a memory, a processor, and a computer program stored on the memory. The processor is configured to execute the computer program to implement the steps of any of the method embodiments described above.
According to an aspect of the present disclosure, a non-transitory computer-readable storage medium is provided, having stored thereon a computer program which, when executed by a processor, implements the steps of any of the method embodiments described above.
According to an aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, performs the steps of any of the method embodiments described above.
Illustrative examples of such computer devices, non-transitory computer-readable storage media, and computer program products are described below in connection with FIG. 5.
Fig. 5 illustrates an example configuration of a computer device 500 that may be used to implement the methods described herein.
Computer device 500 may be a variety of different types of devices, such as a server of a service provider, a device associated with a client (e.g., a client device), a system on a chip, and/or any other suitable computer device or computing system. Examples of computer device 500 include, but are not limited to: a desktop computer, a server computer, a notebook or netbook computer, a mobile device (e.g., a tablet, a cellular or other wireless telephone (e.g., a smartphone), a notepad computer, a mobile station), a wearable device (e.g., glasses, a watch), an entertainment device (e.g., an entertainment appliance, a set-top box communicatively coupled to a display device, a gaming console), a television or other display device, an automotive computer, and so forth. Thus, the computer device 500 may range from a full resource device with substantial memory and processor resources (e.g., personal computers, game consoles) to a low-resource device with limited memory and/or processing resources (e.g., traditional set-top boxes, hand-held game consoles).
The computer device 500 may include at least one processor 502, memory 504, communication interface(s) 506, display device 508, other input/output (I/O) devices 510, and one or more mass storage devices 512, which may be capable of communicating with each other, such as through a system bus 514 or other appropriate connection.
Processor 502 may be a single processing unit or multiple processing units, all of which may include single or multiple computing units or multiple cores. The processor 502 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitry, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor 502 can be configured to retrieve and execute computer-readable instructions stored in the memory 504, mass storage device 512, or other computer-readable medium, such as program code for an operating system 516, program code for an application 518, program code for other programs 520, and so forth.
Memory 504 and mass storage device 512 are examples of computer-readable storage media for storing instructions that are executed by processor 502 to implement the various functions described above. By way of example, the memory 504 may generally include both volatile and nonvolatile memory (e.g., RAM, ROM, and the like). In addition, mass storage device 512 may generally include a hard disk drive, solid state drive, removable media, including external and removable drives, memory cards, flash memory, floppy disks, optical disks (e.g., CD, DVD), storage arrays, network attached storage, storage area networks, and the like. Memory 504 and mass storage device 512 may both be referred to herein collectively as memory or computer-readable storage media, and may be non-transitory media capable of storing computer-readable, processor-executable program instructions as computer program code that may be executed by processor 502 as a particular machine configured to implement the operations and functions described in the examples herein.
A number of program modules may be stored on the mass storage device 512. These programs include an operating system 516, one or more application programs 518, other programs 520, and program data 522, and they may be loaded into memory 504 for execution. Examples of such applications or program modules may include, for instance, computer program logic (e.g., computer program code or instructions) for implementing the following components/functions: the method 100, the data acquisition unit 501, the data grouping unit 502, and the data writing unit 503, and/or further embodiments described herein.
Although illustrated in fig. 5 as being stored in memory 504 of computer device 500, modules 516, 518, 520, and 522, or portions thereof, may be implemented using any form of computer-readable media that is accessible by computer device 500. As used herein, "computer-readable media" includes at least two types of computer-readable media, namely computer storage media and communication media.
Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information for access by a computer device.
In contrast, communication media may embody computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism. Computer storage media, as defined herein, does not include communication media.
Computer device 500 may also include one or more communication interfaces 506 for exchanging data with other devices, such as over a network, a direct connection, and so forth, as previously discussed. Such communication interfaces may be one or more of the following: any type of network interface (e.g., a Network Interface Card (NIC)), wired or wireless (such as IEEE 802.11 wireless lan (wlan)) wireless interface, a global microwave access interoperability (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth. The communication interface 506 may facilitate communication within a variety of networks and protocol types, including wired networks (e.g., LAN, cable, etc.) and wireless networks (e.g., WLAN, cellular, satellite, etc.), the Internet, and so forth. The communication interface 506 may also provide for communication with external storage devices (not shown), such as in storage arrays, network attached storage, storage area networks, and the like.
In some examples, a display device 508, such as a monitor, may be included for displaying information and images to a user. Other I/O devices 510 may be devices that receive various inputs from a user and provide various outputs to the user, and may include touch input devices, gesture input devices, cameras, keyboards, remote controls, mice, printers, audio input/output devices, and so forth.
While the disclosure has been illustrated and described in detail in the drawings and foregoing description, such illustration and description are to be considered illustrative and exemplary and not restrictive; the present disclosure is not limited to the disclosed embodiments. Variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed subject matter, from a study of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps than those listed and the words "a" or "an" do not exclude a plurality. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

Claims (17)

1. A text comparison method, comprising:
acquiring a target vector corresponding to a target text;
aiming at each text to be compared in a plurality of texts to be compared, obtaining a vector to be compared corresponding to the text to be compared;
preliminarily screening a plurality of texts to be compared to obtain a plurality of preliminarily screened texts with similarity greater than a first preset threshold value with the target texts on the basis of the target vectors corresponding to the target texts and the vectors to be compared corresponding to each text to be compared;
acquiring text features of the target text and text features of each of the plurality of preliminary screening texts; and
and screening at least one conflict text which is semantically conflicted with the target text from the plurality of preliminary screened texts based on the corresponding text characteristics.
2. The method of claim 1, further comprising:
constructing a preset word vector library, wherein the word vector library comprises a plurality of candidate words and a word vector corresponding to each candidate word,
and acquiring a target vector corresponding to the target text and a to-be-compared vector corresponding to each to-be-compared text based on the word vector library.
3. The method of claim 2, wherein obtaining a target vector corresponding to a target text comprises:
segmenting the target text to obtain at least one sentence included in the target text;
for each sentence in at least one sentence included in the target text, executing the following steps:
acquiring at least one target word included in the sentence;
obtaining a word vector corresponding to each target word from the word vector library;
determining a corresponding target sentence vector of the sentence based on the corresponding word vector of the at least one target word; and
and determining a target vector corresponding to the target text based on a target sentence vector corresponding to each of at least one sentence included in the target text.
4. The method of claim 2, wherein obtaining the vectors to be compared corresponding to the texts to be compared comprises:
segmenting the text to be compared to obtain at least one sentence included in the text to be compared;
aiming at each sentence in at least one sentence included in the text to be compared, executing the following steps:
acquiring at least one word to be compared included in the sentence;
obtaining a word vector to be compared corresponding to each word to be compared from the word vector library;
determining a sentence vector to be compared corresponding to the sentence based on the respective corresponding word vector of the at least one word to be compared; and
and determining the vector to be compared corresponding to the text to be compared based on the sentence vector to be compared corresponding to at least one sentence included in the text to be compared.
5. The method according to claim 1, wherein the text feature of the target text is determined based on the respective text feature of at least one sentence included in the target text,
the text feature of each of the primary screened texts is determined based on the respective text feature of at least one sentence included in the primary screened text.
6. The method of claim 5, wherein the textual features of a sentence comprise at least one of:
the sentence comprises a subject word, a grammatical structure of the sentence, a number range and a number size of the sentence, and a length of the sentence, wherein the subject word is defined in a preset dictionary.
7. The method of claim 5, wherein filtering at least one conflicting text from the plurality of prescreened texts that semantically conflicts with the target text based on respective text features comprises:
for each of the plurality of preliminary screening texts, determining a conflict sentence which is included in the preliminary screening text and has a semantic conflict with the target text based on the text characteristics corresponding to at least one sentence included in the preliminary screening text and the text characteristics corresponding to at least one sentence included in the target text; and
and in response to determining that the number of corresponding conflicting sentences is greater than a second preset threshold, determining the preliminary screened text as conflicting text.
8. A text comparison apparatus, the apparatus comprising:
the first acquisition module is configured to acquire a target vector corresponding to a target text;
the second acquisition module is configured to acquire a to-be-compared vector corresponding to each to-be-compared text in the plurality of to-be-compared texts;
the first screening module is configured to preliminarily screen a plurality of texts to be compared to obtain a plurality of preliminarily screened texts with similarity greater than a first preset threshold value with the target text based on a target vector corresponding to the target text and a vector to be compared corresponding to each text to be compared;
a third obtaining module configured to obtain a text feature of the target text and a text feature of each of the plurality of preliminary screening texts; and
a second filtering module configured to filter at least one conflicting text that semantically conflicts with the target text from the plurality of preliminary screened texts based on the corresponding text features.
9. The apparatus of claim 8, further comprising:
the construction module is configured to construct a preset word vector library, the word vector library comprises a plurality of candidate words and a word vector corresponding to each candidate word, and a target vector corresponding to the target text and a to-be-compared vector corresponding to each to-be-compared text are obtained based on the word vector library.
10. The apparatus of claim 9, wherein the first obtaining means comprises:
the first segmentation unit is configured to segment the target text to obtain at least one sentence included in the target text;
a first obtaining unit configured to obtain, for each of at least one sentence included in the target text, at least one target word included in the sentence;
the second obtaining unit is configured to obtain a word vector corresponding to each of at least one target word included in each statement from the word vector library;
the first determining unit is configured to determine a corresponding target sentence vector of each sentence based on a corresponding word vector of at least one target word included in each sentence; and
and the second determining unit is configured to determine a corresponding target vector of the target text based on a corresponding target sentence vector of each of at least one sentence included in the target text.
11. The apparatus of claim 9, wherein the second obtaining means comprises:
the second segmentation unit is configured to segment the text to be compared so as to obtain at least one sentence included in the text to be compared;
the third obtaining unit is configured to obtain, for each of at least one sentence included in the text to be compared, at least one word to be compared included in the sentence;
the fourth obtaining unit is configured to obtain a to-be-compared word vector corresponding to each of at least one to-be-compared word included in each statement from the word vector library;
the third determining unit is configured to determine a sentence vector to be compared corresponding to each sentence based on a respective corresponding word vector of at least one word to be compared included in each sentence; and
and the fourth determining unit is configured to determine a to-be-compared vector corresponding to the to-be-compared text based on the to-be-compared sentence vector corresponding to each of at least one sentence included in the to-be-compared text.
12. The apparatus according to claim 8, wherein the third obtaining module is configured to determine the text feature of the target text based on the text feature corresponding to each of the at least one sentence included in the target text and determine the text feature of each of the first screened texts based on the text feature corresponding to each of the at least one sentence included in each of the first screened texts.
13. The apparatus of claim 12, wherein the textual features of a sentence comprise at least one of:
the sentence comprises a subject word, a grammatical structure of the sentence, a number range and a number size of the sentence, and a length of the sentence, wherein the subject word is defined in a preset dictionary.
14. The apparatus of claim 12, wherein the second screening module comprises:
a fifth determining unit, configured to determine, for each of the plurality of preliminary screened texts, a conflict sentence included in the preliminary screened text that is semantically conflicting with the target text based on a text feature corresponding to each of at least one sentence included in the preliminary screened text and a text feature corresponding to each of at least one sentence included in the target text; and
a sixth determining unit configured to determine the preliminary screened text as a conflicting text in response to determining that the number of corresponding conflicting sentences is greater than a second preset threshold.
15. A computer device, comprising:
a memory, a processor, and a computer program stored on the memory,
wherein the processor is configured to execute the computer program to implement the steps of the method of any one of claims 1-7.
16. A non-transitory computer readable storage medium having a computer program stored thereon, wherein the computer program when executed by a processor implements the steps of the method of any of claims 1-7.
17. A computer program product comprising a computer program, wherein the computer program realizes the steps of the method of any one of claims 1-7 when executed by a processor.
CN202111131481.6A 2021-09-26 2021-09-26 Text comparison method, device, equipment and medium Pending CN113821601A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111131481.6A CN113821601A (en) 2021-09-26 2021-09-26 Text comparison method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111131481.6A CN113821601A (en) 2021-09-26 2021-09-26 Text comparison method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN113821601A true CN113821601A (en) 2021-12-21

Family

ID=78915528

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111131481.6A Pending CN113821601A (en) 2021-09-26 2021-09-26 Text comparison method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN113821601A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117829140A (en) * 2024-03-04 2024-04-05 证通股份有限公司 Automatic comparison method and system for regulations and regulations
CN117829140B (en) * 2024-03-04 2024-05-31 证通股份有限公司 Automatic comparison method and system for regulations and regulations

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110377694A (en) * 2019-06-06 2019-10-25 北京百度网讯科技有限公司 Text is marked to the method, apparatus, equipment and computer storage medium of logical relation
CN111539213A (en) * 2020-04-17 2020-08-14 华侨大学 Intelligent detection method for semantic mutual exclusion of multi-source management terms
CN111666761A (en) * 2020-05-13 2020-09-15 北京大学 Fine-grained emotion analysis model training method and device
CN113435182A (en) * 2021-07-21 2021-09-24 唯品会(广州)软件有限公司 Method, device and equipment for detecting conflict of classification labels in natural language processing

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110377694A (en) * 2019-06-06 2019-10-25 北京百度网讯科技有限公司 Text is marked to the method, apparatus, equipment and computer storage medium of logical relation
CN111539213A (en) * 2020-04-17 2020-08-14 华侨大学 Intelligent detection method for semantic mutual exclusion of multi-source management terms
CN111666761A (en) * 2020-05-13 2020-09-15 北京大学 Fine-grained emotion analysis model training method and device
CN113435182A (en) * 2021-07-21 2021-09-24 唯品会(广州)软件有限公司 Method, device and equipment for detecting conflict of classification labels in natural language processing

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王楷翔: "基于蕴涵推理的知识语义冲突识别方法及其实现", 《中国优秀硕士学位论文全文数据库电子期刊网》, pages 21 - 49 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117829140A (en) * 2024-03-04 2024-04-05 证通股份有限公司 Automatic comparison method and system for regulations and regulations
CN117829140B (en) * 2024-03-04 2024-05-31 证通股份有限公司 Automatic comparison method and system for regulations and regulations

Similar Documents

Publication Publication Date Title
CN110020422B (en) Feature word determining method and device and server
CN108334490B (en) Keyword extraction method and keyword extraction device
CN110598157B (en) Target information identification method, device, equipment and storage medium
US10650047B2 (en) Dense subgraph identification
CN112559800B (en) Method, apparatus, electronic device, medium and product for processing video
EP3683695A1 (en) Synonym dictionary creation device, synonym dictionary creation program, and synonym dictionary creation method
CN112256842B (en) Method, electronic device and storage medium for text clustering
CN110046637B (en) Training method, device and equipment for contract paragraph annotation model
JP2020149686A (en) Image processing method, device, server, and storage medium
CN107861948B (en) Label extraction method, device, equipment and medium
CN109271641A (en) A kind of Text similarity computing method, apparatus and electronic equipment
JP2014123286A (en) Document classification device and program
CN102955773B (en) For identifying the method and system of chemical name in Chinese document
CN111538903B (en) Method and device for determining search recommended word, electronic equipment and computer readable medium
CN112784009A (en) Subject term mining method and device, electronic equipment and storage medium
CN110147223B (en) Method, device and equipment for generating component library
CN111027316A (en) Text processing method and device, electronic equipment and computer readable storage medium
CN114003725A (en) Information annotation model construction method and information annotation generation method
US11429317B2 (en) Method, apparatus and computer program product for storing data
CN112148841A (en) Object classification and classification model construction method and device
US9946765B2 (en) Building a domain knowledge and term identity using crowd sourcing
CN115630643A (en) Language model training method and device, electronic equipment and storage medium
JP7387964B2 (en) Training method, sorting method, apparatus, device and medium for sorting learning model
CN113821601A (en) Text comparison method, device, equipment and medium
US20160170983A1 (en) Information management apparatus and information management method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination