CN114254631A - Natural language analysis method and system based on data stream - Google Patents

Natural language analysis method and system based on data stream Download PDF

Info

Publication number
CN114254631A
CN114254631A CN202111461882.8A CN202111461882A CN114254631A CN 114254631 A CN114254631 A CN 114254631A CN 202111461882 A CN202111461882 A CN 202111461882A CN 114254631 A CN114254631 A CN 114254631A
Authority
CN
China
Prior art keywords
word
sentence
model
sentences
data stream
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111461882.8A
Other languages
Chinese (zh)
Inventor
苏长君
曾祥禄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhimei Internet Technology Co ltd
Original Assignee
Beijing Zhimei Internet Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhimei Internet Technology Co ltd filed Critical Beijing Zhimei Internet Technology Co ltd
Priority to CN202111461882.8A priority Critical patent/CN114254631A/en
Publication of CN114254631A publication Critical patent/CN114254631A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a natural language analysis method and a system based on data flow, which convert the data flow into a form more suitable for natural language by proper processing, endow the data flow with a structured vector sequence according to a tree structure, input the vector sequence into a syntax model to break sentences to obtain first word components, input the first word components into a semantic analysis model one by one to obtain second word components, and form a new sentence according to the preset mapping relation between the type of a phrase and a weight value, thereby identifying the meaning of the new sentence.

Description

Natural language analysis method and system based on data stream
Technical Field
The present application relates to the field of network multimedia, and in particular, to a natural language analysis method and system based on data stream.
Background
When the existing natural language analysis algorithm faces mass data streams, the problems of high energy consumption and slow operation exist, and the existing natural language analysis algorithm needs to be improved. The need to properly process the data stream is a matter of consideration for those skilled in the art.
Therefore, there is a need for a targeted data stream oriented natural language analysis based method and system.
Disclosure of Invention
The invention aims to provide a method and a system for analyzing natural language based on data stream, which convert the data stream into a form more suitable for natural language by proper processing, endow the data stream with a structured vector sequence according to a tree structure, input the vector sequence into a syntactic model for sentence breaking to obtain a first word component, input the first word component into a semantic analysis model one by one to obtain a second word component, and form a new sentence according to a preset mapping relation between a phrase type and a weighted value, thereby identifying the meaning of the new sentence.
In a first aspect, the present application provides a natural language analysis method based on data stream oriented, the method including:
acquiring a network data stream, extracting carried sentences and additional element information from the network data stream, wherein the additional element information refers to identifiers used for distinguishing different sentences and different sources, mapping the sentences and the additional element information into data with character string type attributes respectively, and vectorizing to obtain a first vector sequence;
sequentially endowing the first vector sequences to a tree structure according to the sequence of head to tail connection, wherein the vector sequences corresponding to the additional element information are positioned at the subtree leaves of the vector sequences corresponding to the sentences of the same source, and a tree-structured second vector sequence is obtained;
inputting the second vector sequence into a syntactic model, and performing preliminary sentence segmentation to obtain a first word component, wherein the syntactic model is provided with extraction windows with different widths according to each word type, the extraction windows are used as sentence segmentation basis, and words in the window width form the first word component;
inputting the first word components into a semantic analysis model one by one, if the first word components can be identified as short sentences, determining that the preliminary sentence break of the first word components is unsuccessful, inputting the first word components into the syntactic model again, and performing sentence break again to obtain second word components; if the short sentence cannot be recognized and the phrase cannot be recognized, the preliminary sentence break of the first word component is determined to be successful, and the first word component is directly marked as a second word component; the phrase consists of a plurality of words and has no syntactic structure;
repeatedly inputting the second word components into the semantic analysis model one by one until each second word component is determined to be successful in preliminary sentence breaking;
and analyzing all the second word components after the preliminary sentence break according to a preset mapping relation between the phrase types and the weight values, clustering the second word components with the weight values larger than a threshold value to form a new sentence, and identifying the meaning of the new sentence.
With reference to the first aspect, in a first possible implementation manner of the first aspect, the setting of the extraction windows with different widths according to each word type includes updating the word type, and establishing a corresponding relationship between the new word type and the width of the extraction window.
With reference to the first aspect, in a second possible implementation manner of the first aspect, the semantic analysis model performs semantic analysis according to sentence syntax requirements.
With reference to the first aspect, in a third possible implementation manner of the first aspect, the kernels of the semantic analysis model and the syntax model both use a neural network model.
In a second aspect, the present application provides a system based on natural language analysis oriented to data streams, the system comprising a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to perform the method of any one of the four possibilities of the first aspect according to instructions in the program code.
In a third aspect, the present application provides a computer readable storage medium for storing program code for performing the method of any one of the four possibilities of the first aspect.
The invention provides a method and a system for analyzing natural language based on data stream, which convert the data stream into a form more suitable for natural language by proper processing, endow the data stream with a structured vector sequence according to a tree structure, input the vector sequence into a syntax model for sentence breaking to obtain a first word component, input the first word component into a semantic analysis model one by one to obtain a second word component, and form a new sentence according to a preset mapping relation between a phrase type and a weighted value, thereby identifying the meaning of the new sentence.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings so that the advantages and features of the present invention can be more easily understood by those skilled in the art, and the scope of the present invention will be more clearly and clearly defined.
Fig. 1 is a flowchart of a method for natural language analysis based on data stream, which includes:
acquiring a network data stream, extracting carried sentences and additional element information from the network data stream, wherein the additional element information refers to identifiers used for distinguishing different sentences and different sources, mapping the sentences and the additional element information into data with character string type attributes respectively, and vectorizing to obtain a first vector sequence;
sequentially endowing the first vector sequences to a tree structure according to the sequence of head to tail connection, wherein the vector sequences corresponding to the additional element information are positioned at the subtree leaves of the vector sequences corresponding to the sentences of the same source, and a tree-structured second vector sequence is obtained;
inputting the second vector sequence into a syntactic model, and performing preliminary sentence segmentation to obtain a first word component, wherein the syntactic model is provided with extraction windows with different widths according to each word type, the extraction windows are used as sentence segmentation basis, and words in the window width form the first word component;
inputting the first word components into a semantic analysis model one by one, if the first word components can be identified as short sentences, determining that the preliminary sentence break of the first word components is unsuccessful, inputting the first word components into the syntactic model again, and performing sentence break again to obtain second word components; if the short sentence cannot be recognized and the phrase cannot be recognized, the preliminary sentence break of the first word component is determined to be successful, and the first word component is directly marked as a second word component; the phrase consists of a plurality of words and has no syntactic structure;
repeatedly inputting the second word components into the semantic analysis model one by one until each second word component is determined to be successful in preliminary sentence breaking;
and analyzing all the second word components after the preliminary sentence break according to a preset mapping relation between the phrase types and the weight values, clustering the second word components with the weight values larger than a threshold value to form a new sentence, and identifying the meaning of the new sentence.
In some preferred embodiments, the setting of the extraction window with different widths according to each word type includes updating the word type, and establishing a corresponding relationship between the new word type and the width of the extraction window.
In some preferred embodiments, the semantic analysis model performs semantic analysis according to sentence grammar requirements.
In some preferred embodiments, the kernels of the semantic analysis model and the syntactic model both use a neural network model.
The application provides a system based on natural language analysis facing data stream, the system includes: the system includes a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to perform the method according to any of the embodiments of the first aspect according to instructions in the program code.
The present application provides a computer readable storage medium for storing program code for performing the method of any of the embodiments of the first aspect.
In specific implementation, the present invention further provides a computer storage medium, where the computer storage medium may store a program, and the program may include some or all of the steps in the embodiments of the present invention when executed. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM) or a Random Access Memory (RAM).
Those skilled in the art will readily appreciate that the techniques of the embodiments of the present invention may be implemented as software plus a required general purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.
The same and similar parts in the various embodiments of the present specification may be referred to each other. In particular, for the embodiments, since they are substantially similar to the method embodiments, the description is simple, and the relevant points can be referred to the description in the method embodiments.
The above-described embodiments of the present invention should not be construed as limiting the scope of the present invention.

Claims (6)

1. A natural language analysis method based on data stream oriented, the method comprising:
acquiring a network data stream, extracting carried sentences and additional element information from the network data stream, wherein the additional element information refers to identifiers used for distinguishing different sentences and different sources, mapping the sentences and the additional element information into data with character string type attributes respectively, and vectorizing to obtain a first vector sequence;
sequentially endowing the first vector sequences to a tree structure according to the sequence of head to tail connection, wherein the vector sequences corresponding to the additional element information are positioned at the subtree leaves of the vector sequences corresponding to the sentences of the same source, and a tree-structured second vector sequence is obtained;
inputting the second vector sequence into a syntactic model, and performing preliminary sentence segmentation to obtain a first word component, wherein the syntactic model is provided with extraction windows with different widths according to each word type, the extraction windows are used as sentence segmentation basis, and words in the window width form the first word component;
inputting the first word components into a semantic analysis model one by one, if the first word components can be identified as short sentences, determining that the preliminary sentence break of the first word components is unsuccessful, inputting the first word components into the syntactic model again, and performing sentence break again to obtain second word components; if the short sentence cannot be recognized and the phrase cannot be recognized, the preliminary sentence break of the first word component is determined to be successful, and the first word component is directly marked as a second word component; the phrase consists of a plurality of words and has no syntactic structure;
repeatedly inputting the second word components into the semantic analysis model one by one until each second word component is determined to be successful in preliminary sentence breaking;
and analyzing all the second word components after the preliminary sentence break according to a preset mapping relation between the phrase types and the weight values, clustering the second word components with the weight values larger than a threshold value to form a new sentence, and identifying the meaning of the new sentence.
2. The method of claim 1, wherein: and setting extraction windows with different widths according to each word type, including updating the word type, and establishing a corresponding relation between the new word type and the width of the extraction window.
3. The method according to any one of claims 1-2, wherein: and the semantic analysis model performs semantic analysis according to sentence grammar requirements.
4. A method according to any one of claims 1-3, characterized in that: the kernels of the semantic analysis model and the syntactic model both use a neural network model.
5. A stream-oriented natural language parsing system, the system comprising a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to perform the method according to instructions in the program code to implement any of claims 1-4.
6. A computer-readable storage medium, characterized in that the computer-readable storage medium is configured to store a program code for performing implementing the method of any of claims 1-4.
CN202111461882.8A 2021-12-02 2021-12-02 Natural language analysis method and system based on data stream Pending CN114254631A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111461882.8A CN114254631A (en) 2021-12-02 2021-12-02 Natural language analysis method and system based on data stream

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111461882.8A CN114254631A (en) 2021-12-02 2021-12-02 Natural language analysis method and system based on data stream

Publications (1)

Publication Number Publication Date
CN114254631A true CN114254631A (en) 2022-03-29

Family

ID=80793890

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111461882.8A Pending CN114254631A (en) 2021-12-02 2021-12-02 Natural language analysis method and system based on data stream

Country Status (1)

Country Link
CN (1) CN114254631A (en)

Similar Documents

Publication Publication Date Title
CN110321432B (en) Text event information extraction method, electronic device and nonvolatile storage medium
JP6653334B2 (en) Information extraction method and device
CN111310470B (en) Chinese named entity recognition method fusing word and word features
CN111444330A (en) Method, device and equipment for extracting short text keywords and storage medium
US20180365209A1 (en) Artificial intelligence based method and apparatus for segmenting sentence
WO2022222300A1 (en) Open relationship extraction method and apparatus, electronic device, and storage medium
CN109635288A (en) A kind of resume abstracting method based on deep neural network
CN115328756A (en) Test case generation method, device and equipment
WO2022134779A1 (en) Method, apparatus and device for extracting character action related data, and storage medium
CN116149669B (en) Binary file-based software component analysis method, binary file-based software component analysis device and binary file-based medium
CN108763202A (en) Method, apparatus, equipment and the readable storage medium storing program for executing of the sensitive text of identification
CN114997164A (en) Text generation method and device
CN111460829A (en) Intention identification method, device and equipment under multi-scene application and storage medium
CN111492364B (en) Data labeling method and device and storage medium
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
CN112560425A (en) Template generation method and device, electronic equipment and storage medium
CN115858776B (en) Variant text classification recognition method, system, storage medium and electronic equipment
CN114254631A (en) Natural language analysis method and system based on data stream
Bladier et al. German and French neural supertagging experiments for LTAG parsing
CN115168544A (en) Information extraction method, electronic device and storage medium
CN114519357B (en) Natural language processing method and system based on machine learning
CN114239592A (en) Intelligent scheduling method and system based on natural language analysis
CN113326691B (en) Data processing method and device, electronic equipment and computer readable medium
JP2014112306A (en) Demand sentence extract device, demand content identification model learning device, method and program
US11875141B2 (en) System and method for training a neural machine translation model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 607a, 6 / F, No. 31, Fuchengmenwai street, Xicheng District, Beijing 100037

Applicant after: Beijing Guorui Digital Intelligence Technology Co.,Ltd.

Address before: 607a, 6 / F, No. 31, Fuchengmenwai street, Xicheng District, Beijing 100037

Applicant before: Beijing Zhimei Internet Technology Co.,Ltd.