WO2022142593A1 - Procédé et appareil de classification de texte, dispositif électronique et support de stockage lisible - Google Patents

Procédé et appareil de classification de texte, dispositif électronique et support de stockage lisible Download PDF

Info

Publication number
WO2022142593A1
WO2022142593A1 PCT/CN2021/123898 CN2021123898W WO2022142593A1 WO 2022142593 A1 WO2022142593 A1 WO 2022142593A1 CN 2021123898 W CN2021123898 W CN 2021123898W WO 2022142593 A1 WO2022142593 A1 WO 2022142593A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
keyword
text
word
target
Prior art date
Application number
PCT/CN2021/123898
Other languages
English (en)
Chinese (zh)
Inventor
蒋宏达
徐国强
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2022142593A1 publication Critical patent/WO2022142593A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/16Real estate

Definitions

  • the present application relates to the field of artificial intelligence, and in particular, to a text classification method, apparatus, electronic device, and readable storage medium.
  • the inventor realized that the current classification of real estate texts is to use the traditional Text-RNN model to classify real estate texts according to the overall information of real estate texts, but this method lacks the extraction of local information of keywords in real estate texts. , resulting in low classification accuracy for real estate texts, which is not conducive to the evaluation of real estate by enterprises.
  • a text classification method comprising:
  • a text classification device includes:
  • the receiving module is used for receiving the original text, cleaning the original text to obtain the target text, and extracting the semantic information of the target text to obtain the text semantic information;
  • the extraction module is used to perform word segmentation on the target text to obtain a word segmentation set, extract a keyword set from the word segmentation set, and obtain the part-of-speech information set of the keyword set and the keyword set in the target text. location information set;
  • the processing module is used to convert the keyword set, the position information set and the part of speech information set into the keyword vector set, the position information vector set and the part of speech information vector set by using the preset vector coding mapping table, and the keyword vector set Set, location information vector set and part-of-speech information vector set for vector splicing to obtain the target word vector set;
  • the recognition module is used to recognize the semantic information of each target word vector in the target word vector set by using the pre-trained semantic recognition model to obtain a word semantic information set, according to the text semantic information of the target text and the word semantic information set to identify the text category of the target text.
  • An electronic device comprising:
  • At least one processor and,
  • the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to implement the steps of:
  • a readable storage medium comprising a storage data area and a storage program area, the storage data area stores data created, and the storage program area stores a computer program; wherein, the computer program is implemented by a processor The following steps are implemented when executed:
  • the present application can improve the accuracy of text classification.
  • FIG. 1 is a schematic flowchart of a text classification method provided by an embodiment of the present application.
  • FIG. 3 is a detailed schematic flowchart of S6 in a text classification method provided by an embodiment of the present application.
  • FIG. 4 is a schematic block diagram of a text classification apparatus provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of an internal structure of an electronic device implementing a text classification method provided by an embodiment of the present application
  • the embodiment of the present application provides a text classification method.
  • the execution body of the text classification method includes, but is not limited to, at least one of electronic devices that can be configured to execute the method provided by the embodiments of the present application, such as a server and a terminal.
  • the text classification method can be executed by software or hardware installed in terminal equipment or server equipment, and the software can be a blockchain platform.
  • the server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.
  • the embodiments of the present application may acquire and process related data based on artificial intelligence technology.
  • Artificial Intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. .
  • the basic technologies of artificial intelligence generally include technologies such as sensors, special artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • the text classification method includes:
  • the original text may be a set composed of multiple sentences, paragraphs or chapters.
  • the original text may be a document about recording enterprise real estate information extracted from an enterprise asset information database, for example, the original text includes real estate general contract contracts, real estate subcontracts, real estate construction contracts and real estate Registration certificate, etc.
  • the cleaning refers to filtering the original text for punctuation and special symbols.
  • the original text can be cleaned by using the regular expression of the Unicode punctuation attribute mode, that is, the symbols in the original text are matched by using the regular expression of the Unicode punctuation attribute mode, and will match the symbols in the original text. The symbols that are successfully matched by the regular expression are filtered to obtain the target text.
  • the original text may be the contract number: R3JG201733, the contracting party: Qingdao West Coast Rail Transit Co., Ltd., all accounts receivable with a contract amount of 40,135,066 yuan;
  • the target text may be stored in a blockchain node.
  • text semantic information can be extracted from the target text by using the currently disclosed Text-RNN model.
  • the word segmentation can use a stuttering word segmentation program based on programming languages such as Python and JAVA, for example, the target text can be all accounts receivable with a contract amount of 40,135,066 yuan; Contract][amount][40135066][yuan][of][all][should][received][accounts].
  • the word frequency of the word segmentation can be obtained by calculating the frequency of word segmentation in the word segmentation set, and the word segmentation with the word frequency exceeding a preset threshold value is used as a keyword, wherein the preset threshold value can be 0.07, It is not limited to the values listed in this embodiment, and can be set according to actual needs in other embodiments of the present application.
  • f represents the word frequency
  • k represents the target text
  • n represents the number of occurrences of the word segmentation
  • m represents the number of word segmentation sets.
  • the keywords can be future, all, all, divide, and so on.
  • the position information set may use a preset word position encoder to perform position encoding on the keyword set to obtain the position information of the keyword set. For example, after the word segmentation of the target text, the obtained word segmentation set contains 300 word segmentations. According to the preset word position encoder, position encoding is performed on each word segmentation, and the position code of No. 0-299 corresponding to each word segmentation is obtained. In the position encoding, the position encoding corresponding to each keyword in the keyword set is queried, and the position information of each keyword in the target text is obtained.
  • the part-of-speech information set may use the part-of-speech tagging method of HMM based on the Python programming language to perform part-of-speech tagging on the keyword set to obtain the part-of-speech information set of the keyword set.
  • the keyword can be "all”
  • the part-of-speech information set of "all” obtained after processing based on the part-of-speech tagging method of the HMM includes adjectives, pronouns, and the like.
  • the encoding mapping table is a matrix mapping method, and the keyword set, location information set and part-of-speech information set are obtained by encoding the keyword set, location information set and part-of-speech information set Corresponding vector codes respectively, according to the corresponding vector representation in the vector code query mapping table, the vector codes are correspondingly converted to obtain a keyword vector set, a position information vector set and a part of speech information vector set.
  • the method before the step S4, further includes: identifying whether the number of characters in each keyword vector in the keyword vector set exceeds a preset number, if the number of characters in the keyword vector does not If the number of characters in the keyword vector exceeds the preset number, the keyword vector will be used as the keyword vector of the vector splicing.
  • Each word vector is combined as a keyword vector of the vector splicing.
  • the preset number in this embodiment of the present application may be 1.
  • the embodiment of the present application utilizes the following method to combine each word vector in the keyword vector and the keyword vector:
  • w emb represents the keyword vector of vector splicing
  • word emb represents the keyword vector
  • N represents the number of characters
  • char emb represents the word vector
  • the following methods are used to perform vector splicing on the keyword vector set, the position information vector set and the part of speech information vector set to obtain the target word vector set tar emb :
  • tar emb represents the target word vector set
  • word emb represents the keyword vector set
  • Pos emb represents the position information vector set
  • loc emb represents the part-of-speech information vector set.
  • the semantic recognition model includes a long short-term memory neural network and a feature activation function.
  • the semantic information of each target word vector in the target word vector set is identified by using the pre-trained semantic recognition model to obtain a word semantic information set, including:
  • the embodiment of the present application uses the following feature activation function to perform activation calculation on the word semantic feature set:
  • F(x n ) represents the word semantic information of the target word vector n
  • x n represents the nth word semantic feature of the word semantic feature set.
  • training the semantic recognition model includes:
  • the feature is activated and calculated to obtain the semantic information of the training set.
  • the loss function of the semantic recognition model is used to calculate the loss value of the training set. When the loss value is not less than a preset threshold, The internal parameters of the semantic recognition model are adjusted until the loss value is less than the preset threshold, and a trained semantic recognition model is obtained.
  • the identifying the text category of the target text according to the text semantic information of the target text and the word semantic information set includes:
  • the text category of the target text only belongs to one category.
  • the category semantic information may include project contracting method, project pricing method, project construction content, etc.
  • the project contracting method may include general contracting method and subcontracting method, and the project pricing method may include fixed
  • the engineering construction content may include unit engineering and sub-item engineering.
  • the text classification method obtaineds the keywords in the text and the position information and part-of-speech information of the keywords, and obtains the target vector by splicing the keywords, the position information and the part-of-speech information into vectors, and uses the pre-
  • the trained semantic recognition model obtains the word semantic information of the keyword.
  • the text type of the text is identified.
  • the embodiment of the present application identifies the text type by combining the overall semantic information of the text and the local semantic information of the text keywords, thereby improving the accuracy of text classification. Therefore, the text classification method, device and readable storage medium proposed in this application can identify the text type of the text by combining the overall semantic information of the text and the local semantic information of the text keywords, thereby improving the accuracy of text classification.
  • FIG. 4 it is a schematic diagram of a module of the text classification apparatus of the present application.
  • the text classification apparatus 100 described in this application can be installed in an electronic device. According to the implemented functions, the text classification apparatus may include a receiving module 101 , an extracting module 102 , a processing module 103 and an identifying module 104 .
  • the modules described in the present invention can also be called units, which refer to a series of computer program segments that can be executed by the electronic device processor and can perform fixed functions, and are stored in the memory of the electronic device.
  • each module/unit is as follows:
  • the receiving module 101 is configured to receive the original text, clean the original text to obtain the target text, and extract the semantic information of the target text to obtain the text semantic information.
  • the original text may be a set composed of multiple sentences, paragraphs or chapters.
  • the original text may be a document about recording enterprise real estate information extracted from an enterprise asset information database, and the document is used as the original text.
  • the original text includes real estate general contract, real estate subcontract, real estate construction contract and real estate registration certificate.
  • the cleaning refers to filtering the original text for punctuation and special symbols.
  • the original text can be cleaned by using the regular expression of the Unicode punctuation attribute mode, that is, the symbols in the original text can be matched by using the regular expression of the Unicode punctuation attribute mode, and will match the symbols in the original text.
  • the symbols that are successfully matched by the regular expression are filtered to obtain the target text.
  • the original text may be the contract number: R3JG201733, the contracting party: Qingdao West Coast Rail Transit Co., Ltd., all accounts receivable with a contract amount of 40,135,066 yuan;
  • text semantic information can be extracted from the target text by using the currently disclosed Text-RNN model.
  • the extraction module 102 is configured to perform word segmentation on the target text, obtain a word segmentation set, extract a keyword set from the word segmentation set, and obtain the part of speech information set of the keyword set and the keyword set in the The set of location information in the target text.
  • the word segmentation can use a stuttering word segmentation program based on programming languages such as Python and JAVA, for example, the target text can be all accounts receivable with a contract amount of 40,135,066 yuan; Contract][amount][40135066][yuan][of][all][should][received][accounts].
  • the word frequency of the word segmentation can be obtained by calculating the frequency of word segmentation in the word segmentation set, and the word segmentation with the word frequency exceeding a preset threshold value is used as a keyword, wherein the preset threshold value can be 0.07, It is not limited to the values listed in this embodiment, and can be set according to actual needs in other embodiments of the present application.
  • f represents the word frequency
  • k represents the target text
  • n represents the number of occurrences of the word segmentation
  • m represents the number of word segmentation sets.
  • the keywords can be future, all, all, divide, and so on.
  • the position information set may use a preset word position encoder to perform position encoding on the keyword set to obtain the position information of the keyword set. For example, after the word segmentation of the target text, the obtained word segmentation set contains 300 word segmentations. According to the preset word position encoder, position encoding is performed on each word segmentation, and the position code of No. 0-299 corresponding to each word segmentation is obtained. In the position encoding, the position encoding corresponding to each keyword in the keyword set is queried, and the position information of each keyword in the target text is obtained.
  • the part-of-speech information set may use the part-of-speech tagging method of HMM based on the Python programming language to perform part-of-speech tagging on the keyword set to obtain the part-of-speech information set of the keyword set.
  • the keyword can be "all”
  • the part-of-speech information set of "all” obtained after processing based on the part-of-speech tagging method of the HMM includes adjectives, pronouns, and the like.
  • the processing module 103 is configured to convert the keyword set, location information set and part-of-speech information set into a keyword vector set, location information vector set and part-of-speech information vector set using a preset vector coding mapping table, The keyword vector set, the location information vector set and the part of speech information vector set are vector spliced to obtain the target word vector set.
  • the encoding mapping table is a matrix mapping method
  • the keyword set, location information set and part of speech information set are obtained by encoding the keyword set, location information set and part of speech information set.
  • the vector codes corresponding to the part-of-speech information sets respectively, and the vector codes are correspondingly converted according to the corresponding vector representations in the vector code query mapping table to obtain a keyword vector set, a position information vector set and a part-of-speech information vector set.
  • the processing module 103 is further configured to: Identify whether the number of characters in each keyword vector in the keyword vector set exceeds a preset number, and if the number of characters in the keyword vector does not exceed the preset number, the keyword vector is used as the vector splicing If the number of characters in the keyword vector exceeds the preset number, the keyword vector and each word vector in the keyword vector are combined and used as the keyword vector of the vector splicing.
  • the preset number in this embodiment of the present application may be 1.
  • the embodiment of the present application utilizes the following method to combine each word vector in the keyword vector and the keyword vector:
  • w emb represents the keyword vector of vector splicing
  • word emb represents the keyword vector
  • N represents the number of characters
  • char emb represents the word vector
  • the following methods are used to perform vector splicing on the keyword vector set, the position information vector set and the part of speech information vector set to obtain the target word vector set tar emb :
  • tar emb represents the target word vector set
  • word emb represents the keyword vector set
  • Pos emb represents the position information vector set
  • loc emb represents the part-of-speech information vector set.
  • the recognition module 104 is used to recognize the semantic information of each target word vector in the target word vector set by using the pre-trained semantic recognition model to obtain a word semantic information set, according to the text semantic information of the target text and the A word semantic information set, identifying the text category of the target text.
  • the semantic recognition model includes a long short-term memory neural network and a feature activation function.
  • the recognition module 104 when using the pre-trained semantic recognition model to identify the semantic information of each target word vector in the target word vector set, and obtaining the word semantic information set, the recognition module 104 is used in detail for:
  • the word semantic feature set is activated by using a preset feature activation function to obtain a word semantic information set.
  • the embodiment of the present application uses the following feature activation function to perform activation calculation on the word semantic feature set:
  • F(x n ) represents the word semantic information of the target word vector n
  • x n represents the nth word semantic feature of the word semantic feature set.
  • the pre-trained semantic recognition model is used to identify the semantic information of each target word vector in the target word vector set, and before obtaining the word semantic information set, the recognition module 104 can also be used for training The semantic recognition model.
  • the training of the semantic recognition model includes:
  • the feature is activated and calculated to obtain the semantic information of the training set.
  • the loss function of the semantic recognition model is used to calculate the loss value of the training set. When the loss value is not less than a preset threshold, The internal parameters of the semantic recognition model are adjusted until the loss value is less than the preset threshold, and a trained semantic recognition model is obtained.
  • the recognition module 104 when identifying the text category of the target text according to the text semantic information of the target text and the word semantic information set, the recognition module 104 is specifically configured to:
  • the text category of the target text only belongs to one category.
  • the category semantic information may include project contracting method, project pricing method, project construction content, etc.
  • the project contracting method may include general contracting method and subcontracting method, and the project pricing method may include fixed
  • the engineering construction content may include unit engineering and sub-item engineering.
  • FIG. 5 it is a schematic structural diagram of an electronic device implementing the text classification method of the present application.
  • the electronic device 1 may include a processor 10, a memory 11 and a bus, and may also include a computer program, such as a text classification program 12, stored in the memory 11 and executable on the processor 10.
  • a computer program such as a text classification program 12 stored in the memory 11 and executable on the processor 10.
  • the storage 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, mobile hard disk, multimedia card, card-type storage (for example: SD or DX storage, etc.), magnetic storage, magnetic disk, CD etc.
  • the memory 11 may be an internal storage unit of the electronic device 1 in some embodiments, such as a mobile hard disk of the electronic device 1 .
  • the memory 11 may also be an external storage device of the electronic device 1, such as a pluggable mobile hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital) equipped on the electronic device 1. , SD) card, flash memory card (Flash Card), etc.
  • the memory 11 may also include both an internal storage unit of the electronic device 1 and an external storage device.
  • the memory 11 can not only be used to store application software installed in the electronic device 1 and various types of data, such as the code of the text classification program 12, etc., but also can be used to temporarily store data that has been output or will be output.
  • the processor 10 may be composed of integrated circuits, for example, may be composed of a single packaged integrated circuit, or may be composed of multiple integrated circuits packaged with the same function or different functions, including one or more integrated circuits.
  • Central Processing Unit CPU
  • microprocessor digital processing chip
  • graphics processor and combination of various control chips, etc.
  • the processor 10 is the control core (Control Unit) of the electronic device, and uses various interfaces and lines to connect the various components of the entire electronic device, by running or executing the program or module (for example, executing the program) stored in the memory 11. Text classification programs, etc.), and call data stored in the memory 11 to perform various functions of the electronic device 1 and process data.
  • the bus may be a peripheral component interconnect (PCI for short) bus or an extended industry standard architecture (Extended industry standard architecture, EISA for short) bus or the like.
  • PCI peripheral component interconnect
  • EISA Extended industry standard architecture
  • the bus can be divided into address bus, data bus, control bus and so on.
  • the bus is configured to implement connection communication between the memory 11 and at least one processor 10 and the like.
  • FIG. 5 only shows an electronic device with components. Those skilled in the art can understand that the structure shown in FIG. 5 does not constitute a limitation on the electronic device 1, and may include fewer or more components than those shown in the drawings. components, or a combination of certain components, or a different arrangement of components.
  • the electronic device 1 may also include a power supply (such as a battery) for powering the various components, preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so that the power management
  • the device implements functions such as charge management, discharge management, and power consumption management.
  • the power source may also include one or more DC or AC power sources, recharging devices, power failure detection circuits, power converters or inverters, power status indicators, and any other components.
  • the electronic device 1 may further include various sensors, Bluetooth modules, Wi-Fi modules, etc., which will not be repeated here.
  • the electronic device 1 may also include a network interface, optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a Bluetooth interface, etc.), which is usually used in the electronic device 1 Establish a communication connection with other electronic devices.
  • a network interface optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a Bluetooth interface, etc.), which is usually used in the electronic device 1 Establish a communication connection with other electronic devices.
  • the electronic device 1 may further include a user interface, and the user interface may be a display (Display), an input unit (eg, a keyboard (Keyboard)), optionally, the user interface may also be a standard wired interface or a wireless interface.
  • the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, and the like.
  • the display may also be appropriately called a display screen or a display unit, which is used for displaying information processed in the electronic device 1 and for displaying a visualized user interface.
  • the text classification program 12 stored in the memory 11 in the electronic device 1 is a combination of multiple instructions, and when running in the processor 10, can realize:
  • the modules/units integrated in the electronic device 1 may be stored in a computer-readable storage medium.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, removable hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) .
  • the computer usable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created using Wait.
  • the modules/units integrated in the electronic device 1 may be stored in a computer-readable storage medium.
  • the computer-readable storage medium may be volatile or non-volatile.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a USB flash drive, a removable hard disk, a magnetic disk, an optical disc, a computer memory, a read-only memory (ROM, Read-Only). Memory).
  • the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium may be volatile or non-volatile.
  • the readable storage medium stores a computer program, and the computer program is stored in the When executed by the processor of the electronic device, it can achieve:
  • modules described as separate components may or may not be physically separated, and the components shown as modules may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional module in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware, or can be implemented in the form of hardware plus software function modules.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Business, Economics & Management (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Tourism & Hospitality (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne un procédé de classification de texte, comprenant les étapes suivantes : nettoyer un texte original pour obtenir un texte cible, segmenter le texte cible pour obtenir un ensemble de segmentation de mots, extraire un ensemble de mots-clés pour acquérir un ensemble d'informations d'emplacement et un ensemble d'informations de partie du discours de l'ensemble de mots-clés, convertir l'ensemble de mots-clés, l'ensemble d'informations d'emplacement et l'ensemble d'informations de partie du discours en un ensemble de vecteurs de mots-clés, un ensemble de vecteurs d'informations d'emplacement et un ensemble de vecteurs d'informations de partie du discours, puis effectuer un assemblage de vecteurs pour obtenir un ensemble de vecteurs de mots cibles, identifier des informations sémantiques de l'ensemble de vecteurs de mots cibles au moyen d'un modèle de reconnaissance sémantique de façon à obtenir un ensemble d'informations sémantiques de mots, et en fonction des informations sémantiques de texte du texte cible et de l'ensemble d'informations sémantiques de mots, identifier la catégorie de texte du texte cible. La présente invention se rapporte en outre à la technologie de chaîne de blocs, et le texte cible peut être stocké dans un nœud de chaîne de blocs. L'invention concerne également un appareil de classification de texte, un dispositif électronique et un support de stockage lisible. Le présent procédé peut améliorer la précision de la classification de texte.
PCT/CN2021/123898 2020-12-28 2021-10-14 Procédé et appareil de classification de texte, dispositif électronique et support de stockage lisible WO2022142593A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011581315.1 2020-12-28
CN202011581315.1A CN112597312A (zh) 2020-12-28 2020-12-28 文本分类方法、装置、电子设备及可读存储介质

Publications (1)

Publication Number Publication Date
WO2022142593A1 true WO2022142593A1 (fr) 2022-07-07

Family

ID=75203640

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/123898 WO2022142593A1 (fr) 2020-12-28 2021-10-14 Procédé et appareil de classification de texte, dispositif électronique et support de stockage lisible

Country Status (2)

Country Link
CN (1) CN112597312A (fr)
WO (1) WO2022142593A1 (fr)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115169291A (zh) * 2022-07-14 2022-10-11 中国建筑西南设计研究院有限公司 文本转换方法、装置、终端设备和计算机可读存储介质
CN115543925A (zh) * 2022-12-02 2022-12-30 北京德风新征程科技有限公司 文件处理方法、装置、电子设备和计算机可读介质
CN116561652A (zh) * 2023-04-04 2023-08-08 陆泽科技有限公司 一种标签标注方法及装置、电子设备、存储介质
CN116664319A (zh) * 2023-08-01 2023-08-29 北京力码科技有限公司 一种基于大数据的金融保单分类系统
CN117273667A (zh) * 2023-11-22 2023-12-22 浪潮通用软件有限公司 一种单据审核处理方法及设备
WO2024060066A1 (fr) * 2022-09-21 2024-03-28 京东方科技集团股份有限公司 Procédé de reconnaissance de texte, modèle et dispositif électronique

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112597312A (zh) * 2020-12-28 2021-04-02 深圳壹账通智能科技有限公司 文本分类方法、装置、电子设备及可读存储介质
CN113515591B (zh) * 2021-04-22 2024-03-15 平安科技(深圳)有限公司 文本不良信息识别方法、装置、电子设备及存储介质
CN113239190B (zh) * 2021-04-27 2024-02-20 天九共享网络科技集团有限公司 文档分类方法、装置、存储介质及电子设备
CN113157927B (zh) * 2021-05-27 2023-10-31 中国平安人寿保险股份有限公司 文本分类方法、装置、电子设备及可读存储介质
CN113626605B (zh) * 2021-08-31 2023-11-28 中国平安财产保险股份有限公司 信息分类方法、装置、电子设备及可读存储介质
CN114048736A (zh) * 2021-10-21 2022-02-15 盐城金堤科技有限公司 执行主体的提取方法、装置、存储介质和电子设备
CN114118041A (zh) * 2021-11-01 2022-03-01 深圳前海微众银行股份有限公司 一种文本生成方法及装置、存储介质
CN114943306A (zh) * 2022-06-24 2022-08-26 平安普惠企业管理有限公司 意图分类方法、装置、设备及存储介质
CN114996463B (zh) * 2022-07-18 2022-11-01 武汉大学人民医院(湖北省人民医院) 一种病例的智能分类方法和装置
CN116189193B (zh) * 2023-04-25 2023-11-10 杭州镭湖科技有限公司 一种基于样本信息的数据存储可视化方法和装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109145097A (zh) * 2018-06-11 2019-01-04 人民法院信息技术服务中心 一种基于信息提取的裁判文书分类方法
CN111126053A (zh) * 2018-10-31 2020-05-08 北京国双科技有限公司 一种信息处理方法及相关设备
CN111680168A (zh) * 2020-05-29 2020-09-18 平安银行股份有限公司 文本特征语义提取方法、装置、电子设备及存储介质
CN111881291A (zh) * 2020-06-19 2020-11-03 山东师范大学 一种文本情感分类方法及系统
CN111930940A (zh) * 2020-07-30 2020-11-13 腾讯科技(深圳)有限公司 一种文本情感分类方法、装置、电子设备及存储介质
CN112597312A (zh) * 2020-12-28 2021-04-02 深圳壹账通智能科技有限公司 文本分类方法、装置、电子设备及可读存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109145097A (zh) * 2018-06-11 2019-01-04 人民法院信息技术服务中心 一种基于信息提取的裁判文书分类方法
CN111126053A (zh) * 2018-10-31 2020-05-08 北京国双科技有限公司 一种信息处理方法及相关设备
CN111680168A (zh) * 2020-05-29 2020-09-18 平安银行股份有限公司 文本特征语义提取方法、装置、电子设备及存储介质
CN111881291A (zh) * 2020-06-19 2020-11-03 山东师范大学 一种文本情感分类方法及系统
CN111930940A (zh) * 2020-07-30 2020-11-13 腾讯科技(深圳)有限公司 一种文本情感分类方法、装置、电子设备及存储介质
CN112597312A (zh) * 2020-12-28 2021-04-02 深圳壹账通智能科技有限公司 文本分类方法、装置、电子设备及可读存储介质

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115169291A (zh) * 2022-07-14 2022-10-11 中国建筑西南设计研究院有限公司 文本转换方法、装置、终端设备和计算机可读存储介质
CN115169291B (zh) * 2022-07-14 2023-05-12 中国建筑西南设计研究院有限公司 文本转换方法、装置、终端设备和计算机可读存储介质
WO2024060066A1 (fr) * 2022-09-21 2024-03-28 京东方科技集团股份有限公司 Procédé de reconnaissance de texte, modèle et dispositif électronique
CN115543925A (zh) * 2022-12-02 2022-12-30 北京德风新征程科技有限公司 文件处理方法、装置、电子设备和计算机可读介质
CN116561652A (zh) * 2023-04-04 2023-08-08 陆泽科技有限公司 一种标签标注方法及装置、电子设备、存储介质
CN116561652B (zh) * 2023-04-04 2024-04-26 陆泽科技有限公司 一种标签标注方法及装置、电子设备、存储介质
CN116664319A (zh) * 2023-08-01 2023-08-29 北京力码科技有限公司 一种基于大数据的金融保单分类系统
CN117273667A (zh) * 2023-11-22 2023-12-22 浪潮通用软件有限公司 一种单据审核处理方法及设备
CN117273667B (zh) * 2023-11-22 2024-02-20 浪潮通用软件有限公司 一种单据审核处理方法及设备

Also Published As

Publication number Publication date
CN112597312A (zh) 2021-04-02

Similar Documents

Publication Publication Date Title
WO2022142593A1 (fr) Procédé et appareil de classification de texte, dispositif électronique et support de stockage lisible
WO2022134759A1 (fr) Procédé et appareil de génération de mots-clés et dispositif électronique et support de stockage informatique
WO2022121171A1 (fr) Procédé et appareil de mise en correspondance de textes similaires, ainsi que dispositif électronique et support de stockage informatique
WO2021208696A1 (fr) Procédé d'analyse d'intention d'utilisateur, appareil, dispositif électronique et support de stockage informatique
US11216618B2 (en) Query processing method, apparatus, server and storage medium
WO2022116435A1 (fr) Procédé et appareil de génération de titre, dispositif électronique et support de stockage
CN112507704B (zh) 多意图识别方法、装置、设备及存储介质
US20120158742A1 (en) Managing documents using weighted prevalence data for statements
CN113704429A (zh) 基于半监督学习的意图识别方法、装置、设备及介质
CN113722483A (zh) 话题分类方法、装置、设备及存储介质
CN112951233A (zh) 语音问答方法、装置、电子设备及可读存储介质
CN113205814B (zh) 语音数据标注方法、装置、电子设备及存储介质
CN115221276A (zh) 基于clip的中文图文检索模型训练方法、装置、设备及介质
CN113360654B (zh) 文本分类方法、装置、电子设备及可读存储介质
CN113344125B (zh) 长文本匹配识别方法、装置、电子设备及存储介质
CN112579781B (zh) 文本归类方法、装置、电子设备及介质
CN112015866A (zh) 用于生成同义文本的方法、装置、电子设备及存储介质
CN116468025A (zh) 电子病历结构化方法、装置、电子设备及存储介质
CN116978028A (zh) 视频处理方法、装置、电子设备及存储介质
WO2022141867A1 (fr) Procédé et appareil de reconnaissance de parole, dispositif électronique et support de stockage lisible
CN115525761A (zh) 一种文章关键词筛选类别的方法、装置、设备及存储介质
CN115409041A (zh) 一种非结构化数据提取方法、装置、设备及存储介质
WO2022141838A1 (fr) Procédé et appareil d'analyse de confiance de modèle, dispositif électronique et support de stockage informatique
US11461411B2 (en) System and method for parsing visual information to extract data elements from randomly formatted digital documents
CN114548114A (zh) 文本情绪识别方法、装置、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21913338

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 021023)

122 Ep: pct application non-entry in european phase

Ref document number: 21913338

Country of ref document: EP

Kind code of ref document: A1