CN113987118A - Corpus acquisition method, apparatus, device and storage medium - Google Patents

Corpus acquisition method, apparatus, device and storage medium Download PDF

Info

Publication number
CN113987118A
CN113987118A CN202111154517.2A CN202111154517A CN113987118A CN 113987118 A CN113987118 A CN 113987118A CN 202111154517 A CN202111154517 A CN 202111154517A CN 113987118 A CN113987118 A CN 113987118A
Authority
CN
China
Prior art keywords
corpus
word
information text
words
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111154517.2A
Other languages
Chinese (zh)
Inventor
张�浩
应森辞
董宇
霍承富
周凯
任卫军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Alibaba China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba China Co Ltd filed Critical Alibaba China Co Ltd
Priority to CN202111154517.2A priority Critical patent/CN113987118A/en
Publication of CN113987118A publication Critical patent/CN113987118A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0631Item recommendations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/04Manufacturing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Business, Economics & Management (AREA)
  • Accounting & Taxation (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Finance (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Human Resources & Organizations (AREA)
  • Tourism & Hospitality (AREA)
  • Databases & Information Systems (AREA)
  • Manufacturing & Machinery (AREA)
  • Data Mining & Analysis (AREA)
  • Development Economics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the disclosure relates to a corpus obtaining method, a corpus obtaining device and a storage medium, wherein the method obtains a plurality of information texts of an object, respectively processes each information text to obtain a middle corpus of each information text, the middle corpus comprises at least part of words in the information text, then, the words in the middle corpus can be combined aiming at each middle corpus, and a target corpus of the object is determined based on word frequency and word frequency reverse file frequency TFIDF values of the combined words obtained through combination in the plurality of information texts. The method and the device avoid the problem that more words tend to appear in the corpus caused by only depending on the word frequency in the related technology, can solve the problem of finding the important but not high-frequency corpus, improve the accuracy and comprehensiveness of corpus determination, do not need to construct a dictionary with very large data volume, and save resources.

Description

Corpus acquisition method, apparatus, device and storage medium
Technical Field
The embodiment of the disclosure relates to the technical field of text processing, and in particular, to a corpus acquisition method, device, equipment and storage medium.
Background
The Customer-to-Manufacturer (C2M) model is a business model of Customer-to-Manufacturer. In the C2M model, the manufacturer can plan the commodities according to the historical performance of the commodities, wherein the commodities meet the requirements of consumers and the market. When analyzing the historical performance of the commodity, an entity discovery technology is generally adopted, and how to acquire a corpus which is accurate and comprehensive enough and accords with the characteristics of the commodity from an information text in the entity discovery technology is a technical problem which is long-term addressed by the technical personnel in the field.
Disclosure of Invention
In order to solve the technical problem or at least partially solve the technical problem, embodiments of the present disclosure provide a corpus obtaining method, apparatus, device and storage medium.
A first aspect of the embodiments of the present disclosure provides a corpus acquiring method, including:
acquiring a plurality of information texts of an object;
processing each information text respectively to obtain an intermediate corpus of each information text, wherein the intermediate corpus comprises at least part of words in the information text;
and combining the words in the intermediate linguistic data aiming at each intermediate linguistic data, and determining the target linguistic data of the object based on the word frequency and the word frequency reverse file frequency TFIDF value of the combined words obtained by combination in the plurality of information texts.
A second aspect of the embodiments of the present disclosure provides an apparatus for obtaining corpus, including:
the acquisition module is used for acquiring a plurality of information texts of the object;
the processing module is used for respectively processing each information text to obtain an intermediate corpus of each information text, wherein the intermediate corpus comprises at least part of words in the information text;
and the determining module is used for combining the words in the intermediate linguistic data aiming at each intermediate linguistic data, and determining the target linguistic data of the object based on the word frequency and the word frequency reverse file frequency TFIDF value of the combined words in the plurality of information texts.
A third aspect of embodiments of the present disclosure provides a computer device, which includes a memory and a processor, wherein the memory stores a computer program, and when the computer program is executed by the processor, the method of the first aspect may be implemented.
A fourth aspect of embodiments of the present disclosure provides a computer-readable storage medium having a computer program stored therein, which, when executed by a processor, may implement the method of the first aspect described above.
Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has the following advantages:
according to the method and the device for processing the information texts, the plurality of information texts of the object are obtained, each information text is processed respectively, the middle corpus of each information text is obtained, the middle corpus comprises at least part of words in the information text, then the words in the middle corpus can be combined aiming at each middle corpus, and the target corpus of the object is determined based on word frequency and word frequency reverse file frequency TFIDF values of the combined words obtained through combination in the plurality of information texts. The intermediate corpus is obtained when the corpus is determined in the information text of the object, and then the final corpus is determined according to the intermediate corpus based on the word frequency and the TFIDF value, so that the problem that the corpus tends to generate more words due to the fact that only the word frequency is relied on in the related technology is avoided, the problem of finding the important corpus with low word frequency can be solved, the accuracy and the comprehensiveness of the corpus determination are improved, a dictionary with large data volume is not required to be built, and resources are saved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
Fig. 1 is a schematic diagram of an acquisition scenario of a corpus according to an embodiment of the present disclosure;
fig. 2 is a flowchart of a corpus acquiring method according to an embodiment of the present disclosure;
FIG. 3 is a flowchart of another corpus obtaining method according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of an apparatus for obtaining corpus according to an embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of a computer device in an embodiment of the present disclosure.
Detailed Description
In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that the embodiments and features of the embodiments of the present disclosure may be combined with each other without conflict.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced in other ways than those described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.
In analyzing the historical performance of a commodity, if the analysis is performed on the granularity of a Standardized Product Unit (SPU), the entity discovery technology is an important key technology, and how to obtain a corpus which is accurate enough, comprehensive and consistent with the characteristics of the commodity from an information text in the entity discovery technology is a technical problem that a person skilled in the art is long trying to solve.
In the related technology, a single word in a text is usually obtained first, then two adjacent words form a word, then three words form a word, if the preset word length is 4, then four words form a word, finally, the obtained words form a dictionary, and then the corpus is obtained through screening. For example, the process of determining the corpus in the related art may include: firstly, setting two parameters, namely maximum word length and minimum word frequency, for a text which is not subjected to word segmentation, and then performing word segmentation on all the texts under the condition of the two parameters; a super large top dictionary is constructed based on the process, then the word frequency and the occurrence probability of each word in the top dictionary are counted, all the combination probabilities of each sentence in the text are calculated by using the occurrence probability of the word, the maximum probability of each sentence is selected by using the maximum-Expectation algorithm (EM algorithm), the occurrence probability of the word in the dictionary is updated iteratively, and the word with the lower probability is deleted after the whole corpus is updated so as to simplify the dictionary. However, in the related technology, a super large dictionary needs to be constructed, which consumes resources; the initial probability of each word in the EM algorithm is the word frequency divided by the total word number in the text, so that the algorithm tends to select words with more occurrence times rather than distinctive new words, and the accuracy and comprehensiveness of the corpus cannot meet the requirements.
In view of the technical problem of how to obtain a corpus that is accurate and comprehensive enough and conforms to the characteristics of a commodity from information texts in the related art, an embodiment of the present disclosure provides an obtaining scheme of a corpus, and fig. 1 is an exemplary schematic diagram of a scene where a corpus is obtained according to an embodiment of the present disclosure, as shown in fig. 1, a computer device may obtain a plurality of information texts of an object, and then process each information text to obtain an intermediate corpus of each information text, where the intermediate corpus may include at least some words in the information text; then, the computer device can combine the words in the intermediate linguistic data aiming at each intermediate linguistic data, and determine the target linguistic data of the object based on the word frequency and the word frequency reverse file frequency TFIDF value of the combined words in a plurality of information texts. The intermediate corpus is obtained when the corpus is determined in the information text of the object, and then the final corpus is determined according to the intermediate corpus based on the word frequency and the TFIDF value, so that the problem that the corpus tends to generate more words due to the fact that only the word frequency is relied on in the related technology is avoided, the problem of finding the important corpus with low word frequency can be solved, the accuracy and the comprehensiveness of the corpus determination are improved, a dictionary with large data volume is not required to be built, and resources are saved.
In order to better understand the inventive concept of the embodiments of the present disclosure, the following describes technical solutions of the embodiments of the present disclosure with reference to exemplary embodiments.
Fig. 2 is a flowchart of a corpus obtaining method provided by an embodiment of the present disclosure, which may be executed by a computer device. As shown in fig. 2, the method provided by this embodiment includes the following steps:
step 101, obtaining a plurality of information texts of an object.
The object may be understood as an entity that needs to analyze the historical performance, for example, the object may be a commodity in an e-commerce scene, and the number of the objects is not limited and may include one or more objects. The information text may be unprocessed original text related to the object, and one object may correspond to a plurality of information texts.
In the embodiment of the present disclosure, the computer device may obtain, for an object that needs to be analyzed, a plurality of corresponding information texts, where the source of the information texts is not limited, for example, the information texts input by a user may be obtained, the information texts sent by other computer devices may also be obtained, and the information texts may also be obtained from the internet.
And 102, processing each information text respectively to obtain a middle corpus of each information text.
The intermediate corpus may be a corpus obtained by preprocessing the information text, and the preprocessing may be understood as performing transformation (transform) to generate a structure that is easy to be used in subsequent algorithm processing, that is, to generate the intermediate corpus.
In the embodiment of the present disclosure, the computer device may perform word segmentation on each information text to obtain a first word segmentation result corresponding to each information text, and then may generate an intermediate corpus of each information text based on the first word segmentation result corresponding to each information text, where the first word segmentation result includes a single word and/or word obtained by splitting the information text.
The first word segmentation result can be understood as a broken single structure, and Chinese is divided into single characters considering different Chinese and English situations, while English keeps the whole word. Specifically, the computer device may perform word segmentation for each information text by using a word segmentation algorithm to obtain a corresponding first word segmentation result, where the word segmentation algorithm may be set according to an actual situation, for example, the word segmentation algorithm may use a SubWord algorithm, and the SubWord algorithm may divide the text into minimum units, and these units may be combined to obtain a new word. Illustratively, "open braided wire" will be referred to as "open", "braided" and "wire". Then, the computer device may perform further preprocessing on the first segmentation processing result corresponding to each information text to obtain an intermediate corpus of each information text.
The above-mentioned further preprocessing of the first segmentation processing result corresponding to each information text may be to delete a part of words based on a preset rule, so that the intermediate corpus according to the embodiment of the present disclosure may include at least a part of words in the information text. The preset rule may include deleting invalid characters, replacing numbers, etc., which are only examples and may be specifically set according to the actual situation.
In the scheme, the computer equipment can perform corpus preprocessing on each information text, and converts the original text to obtain an intermediate corpus which is easy to perform subsequent processing, so that the generation of the subsequent target corpus is facilitated.
And 103, combining the words in the intermediate linguistic data aiming at each intermediate linguistic data, and determining the target linguistic data of the object based on the word frequency and the word frequency reverse file frequency TFIDF value of the combined words in the multiple information texts.
The word combination may be a word obtained by combining two words, the word may be a general term of a structure capable of being combined, and the word may include the above-mentioned single character, word, or a combination word. The word frequency may be the number of times a word appears in the text. A Term Frequency Inverse Document Frequency (TFIDF) value may be used to assess how important a word is to a text, which is proportional to the Term Frequency but inversely proportional to the number of texts in which the word appears. The target corpus may be a finally generated corpus that conforms to the characteristics of the object.
Exemplarily, fig. 3 is a flowchart of another corpus obtaining method provided by the embodiment of the present disclosure, and as shown in fig. 3, in a possible implementation manner, for each intermediate corpus, words in the intermediate corpus are combined, and a target corpus of an object is determined based on word frequencies and word frequency inverse file frequencies TFIDF values of the combined words obtained by combining in a plurality of information texts, which may include the following steps:
and 301, combining two adjacent words in the intermediate corpus to obtain a plurality of combined words.
The process of combining may be a process of constructing a word by two characters, and for example, the above combination may be implemented by using bigram (bigram).
Specifically, after the computer device obtains the intermediate corpus, because the initial intermediate corpus includes a single text and/or word, two adjacent words in the initial intermediate corpus can be combined respectively to obtain a plurality of combined words, that is, each intermediate corpus obtains a plurality of combined words.
Step 302, for each combined word, performing product operation on the word frequency of the combined word and the TFIDF value of the combined word to obtain a first operation value.
After the computer device obtains the plurality of combined words, for each combined word, the occurrence frequency of the combined word in the corresponding information text, that is, the word frequency of the combined word, may be determined, and the number of target information texts in which the combined word occurs in the plurality of information texts may be determined. Because the number of sentences in the information text is huge and the time cost of a serial program is too high, a Mapreduce programming model can be adopted in the embodiment of the invention, the sentences can be randomly distributed to different mapping (Mapper) programs for each information text, the word frequency of the combined word is calculated in parallel, and finally the word frequencies of the same combined word are accumulated in a reduction (Reducer) program to obtain the total word frequency of the combined word in one information text.
The computer device may then calculate the quotient of the word frequency of each compound word divided by the number of target information texts containing the compound word and determine the quotient as the TFIDF value of the compound word, which is 5/7, assuming, for example, that the compound word "activation" occurs 5 times in the corresponding information text and the number of information texts in which the compound word "activation" occurs is 7. Then, the computer device may perform a product operation on the word frequency of the compound word and the TFIDF value of the compound word for each compound word to obtain a first operation value, and each compound word is calculated to obtain one first operation value.
And step 303, adding the target compound word with the largest first operation value into the intermediate corpus, and deleting the word combined in the intermediate corpus to obtain the word of the target compound word.
Specifically, after obtaining the corresponding first operation value for each combined word, the computer device may determine, as the target combined word, the combined word with the largest first operation value among the plurality of first operation values, add the target combined word to the intermediate corpus, and adaptively delete two words in the intermediate corpus that are combined into the target combined word.
In the step, the computer device completes two parts of new word discovery and template generation, in the new word discovery process, a Byte Pair Encoding (BPE) algorithm can be adopted to sequence all the combined words according to the first operation value, and then a greedy algorithm is used to select a target combined word with the highest first operation value as a new word; in the template generating process, a wp (word piece) algorithm may be used to traverse through the intermediate corpus after finding the new word, combine the two words with the largest first operation value to generate a target combined word, and add the new word generated by the algorithm iteration as one of the intermediate corpus.
The BPE algorithm described above tends to determine the most frequently occurring entities in the corpus, such as the size entities "numbers" or "sizes" that often occur in the T-shirt category, while the WP algorithm tends to merge two words in the corpus whose maximum probability becomes a new word, i.e., two words that are often combined together, such as "silicone" and "glue" into "silicone", and "braiding" and "weaving" into "braiding", etc.
And 304, skipping to the step of combining two adjacent words in the intermediate corpus to obtain a plurality of combined words, continuing to execute the step until the word frequency of the combined words obtained based on the intermediate corpus is smaller than the preset word frequency, and determining the words in the intermediate corpus as the target corpus of the object.
Specifically, after adding the target compound word into the intermediate corpus and deleting the word of the target compound word combined in the intermediate corpus, the computer device may jump back to step 301 for the intermediate corpus including the target compound word, determine whether the word frequencies of the generated compound word are all smaller than the preset word frequency, if so, stop, and determine the word in the current intermediate corpus as the target corpus; and if any word frequency in the generated combined words is larger than or equal to the preset word frequency, executing the step 302-the step 303 again, and continuously iterating until the word frequency of the combined words obtained based on the intermediate corpus after the step 301 in a certain iteration is smaller than the preset word frequency, and stopping the iteration.
In this step, the computer device may use, as a constraint condition, that all word frequencies of the combined words obtained based on the intermediate corpus are smaller than a preset word frequency, add the target combined word generated in the previous round as a whole to the next round of processing, perform iteration, combine the target combined word in the previous round with an adjacent word, repeat the above process until the constraint condition is satisfied, stop the iteration, and determine a final target corpus of the object.
The scheme for obtaining the corpus can be realized based on Word Probability Encoding (WPE) algorithm, the WPE algorithm is a new Word discovery algorithm based on the idea of the SubWord algorithm, new words meeting conditions in an information text are found out through a Word frequency statistical method, and the new words are extracted as the corpus. Illustratively, for the information text "p 30mate20," mate "would be extracted by the WPE algorithm, while" 30m "would not be extracted.
The scheme belongs to a top-down (top-down) processing scheme, firstly, words are segmented on the finest granularity of a text, then adjacent word blocks are combined, a corpus is determined based on a greedy algorithm and a TFIDF value, a very large dictionary does not need to be constructed, the memory can be saved, and the resource consumption is avoided; compared with the prior art that more words tend to appear instead of distinctive new words, the greedy algorithm in the scheme can ensure that the effect of the sentence after being participled is more reasonable, the TFIDF value can solve the problem of finding important but infrequent new words, and the accuracy and comprehensiveness of the corpus determination are improved.
According to the method and the device for processing the information texts, the plurality of information texts of the object are obtained, each information text is processed respectively, the middle corpus of each information text is obtained, the middle corpus comprises at least part of words in the information text, then the words in the middle corpus can be combined aiming at each middle corpus, and the target corpus of the object is determined based on word frequency and word frequency reverse file frequency TFIDF values of the combined words obtained through combination in the plurality of information texts. The intermediate corpus is obtained when the corpus is determined in the information text of the object, and then the final corpus is determined according to the intermediate corpus based on the word frequency and the TFIDF value, so that the problem that the corpus tends to generate more words due to the fact that only the word frequency is relied on in the related technology is avoided, the problem of finding the important corpus with low word frequency can be solved, the accuracy and the comprehensiveness of the corpus determination are improved, a dictionary with large data volume is not required to be built, and resources are saved.
In a possible implementation manner, when the intermediate corpus of each information text is generated based on the first segmentation processing result corresponding to each information text, the first preset character may be deleted from the first segmentation processing result for the first segmentation processing result of each information text, so as to obtain a second segmentation processing result of the information text; generating an intermediate corpus of the information text based on a second word segmentation processing result of the information text; wherein the first predetermined characters include punctuation marks and/or spaces.
The first predetermined character may be understood as an invalid special character, for example, the first predetermined character may include various punctuation marks, spaces, and the like. Specifically, when the computer device generates the intermediate corpus of each information text based on the first segmentation processing result corresponding to each information text, the computer device may delete the first preset character in each first segmentation processing result to obtain the second segmentation processing result, and then generate the intermediate corpus of the information text based on the second segmentation processing result.
In a possible implementation manner, when the intermediate corpus of the information text is generated based on the second participle processing result of the information text, a word with a digital meaning in the second participle processing result may be replaced by a second preset character, so as to obtain a third participle processing result; and taking the third segmentation processing result as an intermediate corpus.
The above words with numerical meaning are to be understood as numbers characterized by various languages, such as chinese numerals, arabic numerals, etc. The second preset character may be a uniform special mark symbol, and may be specifically set according to the actual situation, for example, the second preset character may be "$".
Specifically, when the computer device generates the intermediate corpus of the information text based on the second participle processing result of the information text, the unified second preset character may be used to replace a word with a numeric meaning in the second participle processing result, and for a chinese number, the word may be converted into an arabic number by the numeric conversion unit and then replaced with the second preset character.
In the scheme, the processing of special characters and words with digital meanings is merged in the process of generating the middle corpus, so that a more uniform mode can be conveniently found in the subsequent process of determining the corpus, the richness and the comprehensiveness of the corpus are improved, and the attribute entity template extracted subsequently is not only a word but also a template.
The embodiment of the disclosure generates accurate and comprehensive linguistic data which accord with the characteristics of the commodities, can be applied to the extraction of the commodity attribute entities of the multi-language mixed text in the field of electronic commerce, and the extracted attributes comprise the specific characteristics of the corresponding commodities, so that the attribute entities are rich. And further, based on the generated entity template and the information of the user, service functions of commodity planning, recommendation strategy making, search recommendation and the like can be realized, and a better service effect can be obtained in the service scene.
Fig. 4 is a schematic structural diagram of an apparatus for acquiring corpus according to an embodiment of the present disclosure, where the processing apparatus may be understood as the computer device or a part of functional modules in the computer device. As shown in fig. 4, the corpus acquiring device 40 includes:
an obtaining module 41, configured to obtain a plurality of information texts of an object;
the processing module 42 is configured to process each information text to obtain an intermediate corpus of each information text, where the intermediate corpus includes at least some words in the information text;
a determining module 43, configured to combine words in the intermediate corpus for each intermediate corpus, and determine a target corpus of the object based on a word frequency and a word frequency inverse file frequency TFIDF value of the combined word in the multiple information texts, which are obtained by the combination.
In one embodiment, the processing module 42 includes:
the processing submodule is used for performing word segmentation processing on each information text respectively to obtain a first word segmentation result corresponding to each information text;
the generating submodule is used for generating a middle corpus of each information text based on the first word segmentation processing result corresponding to each information text;
and the first word segmentation result comprises a single character and/or word obtained by splitting the information text.
In one embodiment, the generation submodule is configured to:
deleting a first preset character from a first word segmentation processing result aiming at the first word segmentation processing result of each information text to obtain a second word segmentation processing result of the information text;
generating an intermediate corpus of the information text based on a second word segmentation processing result of the information text;
wherein the first preset characters comprise punctuation marks and/or spaces.
In one embodiment, the generating sub-module is further configured to:
replacing words with digital meanings in the second word segmentation processing result with second preset characters to obtain a third word segmentation processing result;
and taking the third participle processing result as an intermediate corpus.
In one embodiment, the determining module 43 is configured to:
combining two adjacent words in the intermediate corpus to obtain a plurality of combined words;
for each combined word, performing product operation on the word frequency of the combined word and the TFIDF value of the combined word to obtain a first operation value;
adding the target combined word with the maximum first operation value into the intermediate corpus, and deleting the words combined in the intermediate corpus to obtain the words of the target combined word;
and jumping to the step of combining two adjacent words in the intermediate corpus to obtain a plurality of combined words, and continuing to execute the step until the word frequency of the combined words obtained based on the intermediate corpus is smaller than the preset word frequency, and determining the words in the intermediate corpus as the target corpus of the object.
The apparatus provided in this embodiment can execute the method in any one of the embodiments in fig. 1 to fig. 3, and the execution manner and the beneficial effects are similar, and are not described herein again.
The embodiment of the present disclosure further provides a computer device, which includes a processor and a memory, where the memory stores a computer program, and when the computer program is executed by the processor, the method of any one of the above-mentioned fig. 1 to fig. 3 may be implemented.
For example, fig. 5 is a schematic structural diagram of a computer device in an embodiment of the present disclosure. Referring now specifically to FIG. 5, a block diagram of a computer device 500 suitable for use in implementing embodiments of the present disclosure is shown. The computer device 500 in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), etc., and a stationary terminal such as a digital TV, a desktop computer, etc. The computer device shown in fig. 5 is only an example and should not bring any limitation to the function and scope of use of the embodiments of the present disclosure.
As shown in fig. 5, computer device 500 may include a processing means (e.g., central processing unit, graphics processor, etc.) 501 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage means 508 into a Random Access Memory (RAM) 503. In the RAM503, various programs and data necessary for the operation of the computer apparatus 500 are also stored. The processing device 501, the ROM 502, and the RAM503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
Generally, the following devices may be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 507 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, and the like; storage devices 508 including, for example, magnetic tape, hard disk, etc.; and a communication device 509. The communication means 509 may allow the computer device 500 to communicate with other devices wirelessly or by wire to exchange data. While fig. 5 illustrates a computer device 500 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 509, or installed from the storage means 508, or installed from the ROM 502. The computer program performs the above-described functions defined in the methods of the embodiments of the present disclosure when executed by the processing device 501.
It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.
The computer readable medium may be embodied in the computer device; or may exist separately and not be incorporated into the computer device.
The computer readable medium carries one or more programs which, when executed by the computing device, cause the computing device to: acquiring a plurality of information texts of an object; processing each information text respectively to obtain an intermediate corpus of each information text, wherein the intermediate corpus comprises at least part of words in the information text; and combining the words in the intermediate linguistic data aiming at each intermediate linguistic data, and determining the target linguistic data of the object based on the word frequency and the word frequency reverse file frequency TFIDF value of the combined words obtained by combination in the plurality of information texts.
Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of an element does not in some cases constitute a limitation on the element itself.
The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The embodiments of the present disclosure further provide a computer-readable storage medium, where a computer program is stored in the storage medium, and when the computer program is executed by a processor, the method of any one of the embodiments in fig. 1 to fig. 2 may be implemented, where the execution manner and the beneficial effects are similar, and are not described herein again.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A corpus acquiring method is characterized by comprising the following steps:
acquiring a plurality of information texts of an object;
processing each information text respectively to obtain an intermediate corpus of each information text, wherein the intermediate corpus comprises at least part of words in the information text;
and combining the words in the intermediate linguistic data aiming at each intermediate linguistic data, and determining the target linguistic data of the object based on the word frequency and the word frequency reverse file frequency TFIDF value of the combined words obtained by combination in the plurality of information texts.
2. The method according to claim 1, wherein said processing each information text separately to obtain the intermediate corpus of each information text comprises:
performing word segmentation processing on each information text respectively to obtain a first word segmentation result corresponding to each information text;
generating a middle corpus of each information text based on a first word segmentation processing result corresponding to each information text;
and the first word segmentation result comprises a single character and/or word obtained by splitting the information text.
3. The method according to claim 2, wherein the generating the intermediate corpus of each information text based on the first segmentation processing result corresponding to each information text comprises:
deleting a first preset character from a first word segmentation processing result aiming at the first word segmentation processing result of each information text to obtain a second word segmentation processing result of the information text;
generating an intermediate corpus of the information text based on a second word segmentation processing result of the information text;
wherein the first preset characters comprise punctuation marks and/or spaces.
4. The method according to claim 3, wherein the generating the intermediate corpus of the information text based on the second segmentation processing result of the information text comprises:
replacing words with digital meanings in the second word segmentation processing result with second preset characters to obtain a third word segmentation processing result;
and taking the third participle processing result as an intermediate corpus.
5. The method according to any one of claims 1 to 4, wherein the combining words in the intermediate corpus for each intermediate corpus and determining the target corpus of the object based on the word frequency and the word frequency inverse file frequency TFIDF value of the combined words in the plurality of information texts comprises:
combining two adjacent words in the intermediate corpus to obtain a plurality of combined words;
for each combined word, performing product operation on the word frequency of the combined word and the TFIDF value of the combined word to obtain a first operation value;
adding the target combined word with the maximum first operation value into the intermediate corpus, and deleting the words combined in the intermediate corpus to obtain the words of the target combined word;
and jumping to the step of combining two adjacent words in the intermediate corpus to obtain a plurality of combined words, and continuing to execute the step until the word frequency of the combined words obtained based on the intermediate corpus is smaller than the preset word frequency, and determining the words in the intermediate corpus as the target corpus of the object.
6. An apparatus for obtaining corpus, comprising:
the acquisition module is used for acquiring a plurality of information texts of the object;
the processing module is used for respectively processing each information text to obtain an intermediate corpus of each information text, wherein the intermediate corpus comprises at least part of words in the information text;
and the determining module is used for combining the words in the intermediate linguistic data aiming at each intermediate linguistic data, and determining the target linguistic data of the object based on the word frequency and the word frequency reverse file frequency TFIDF value of the combined words in the plurality of information texts.
7. The apparatus of claim 6, wherein the processing module comprises:
the processing submodule is used for performing word segmentation processing on each information text respectively to obtain a first word segmentation result corresponding to each information text;
the generating submodule is used for generating a middle corpus of each information text based on the first word segmentation processing result corresponding to each information text;
and the first word segmentation result comprises a single character and/or word obtained by splitting the information text.
8. The apparatus of claim 7, wherein the generation submodule is configured to:
deleting a first preset character from a first word segmentation processing result aiming at the first word segmentation processing result of each information text to obtain a second word segmentation processing result of the information text;
generating an intermediate corpus of the information text based on a second word segmentation processing result of the information text;
wherein the first preset characters comprise punctuation marks and/or spaces.
9. A computer device, comprising:
memory and a processor, wherein the memory has stored therein a computer program which, when executed by the processor, implements the method of any of claims 1-5.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-5.
CN202111154517.2A 2021-09-29 2021-09-29 Corpus acquisition method, apparatus, device and storage medium Pending CN113987118A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111154517.2A CN113987118A (en) 2021-09-29 2021-09-29 Corpus acquisition method, apparatus, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111154517.2A CN113987118A (en) 2021-09-29 2021-09-29 Corpus acquisition method, apparatus, device and storage medium

Publications (1)

Publication Number Publication Date
CN113987118A true CN113987118A (en) 2022-01-28

Family

ID=79737291

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111154517.2A Pending CN113987118A (en) 2021-09-29 2021-09-29 Corpus acquisition method, apparatus, device and storage medium

Country Status (1)

Country Link
CN (1) CN113987118A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117910467A (en) * 2024-03-15 2024-04-19 成都启英泰伦科技有限公司 Word segmentation processing method in offline voice recognition process

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117910467A (en) * 2024-03-15 2024-04-19 成都启英泰伦科技有限公司 Word segmentation processing method in offline voice recognition process
CN117910467B (en) * 2024-03-15 2024-05-10 成都启英泰伦科技有限公司 Word segmentation processing method in offline voice recognition process

Similar Documents

Publication Publication Date Title
CN114861889B (en) Deep learning model training method, target object detection method and device
CN115221191A (en) Virtual column construction method based on data lake and data query method
CN110852057A (en) Method and device for calculating text similarity
CN113987118A (en) Corpus acquisition method, apparatus, device and storage medium
CN114036921A (en) Policy information matching method and device
CN115186738B (en) Model training method, device and storage medium
CN113946648B (en) Structured information generation method and device, electronic equipment and medium
CN111737572B (en) Search statement generation method and device and electronic equipment
CN110895655A (en) Method and device for extracting text core phrase
CN113807056A (en) Method, device and equipment for correcting error of document name sequence number
CN111338621A (en) Data display method and device, electronic equipment and computer readable medium
CN108664535B (en) Information output method and device
CN116737762B (en) Structured query statement generation method, device and computer readable medium
CN111626044A (en) Text generation method and device, electronic equipment and computer readable storage medium
CN114385781B (en) Interface file recommendation method, device, equipment and medium based on statement model
CN113609309B (en) Knowledge graph construction method and device, storage medium and electronic equipment
CN115374320B (en) Text matching method and device, electronic equipment and computer medium
CN117172220B (en) Text similarity information generation method, device, equipment and computer readable medium
CN116881997B (en) City planning interface display method and device, electronic equipment and medium
CN116303529A (en) Object acquisition method, device, electronic equipment and computer readable medium
CN116340591A (en) Processing method, device, equipment and storage medium of table data
CN115422907A (en) Multi-dimensional science and technology project item establishment duplicate checking method and device and electronic equipment
CN114925684A (en) Variable name generation method and device, electronic equipment and computer readable medium
CN116431657A (en) Data query statement generation method, device, equipment, storage medium and product
CN115481260A (en) Knowledge graph construction method and device based on audit information and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination