CN111832308B - Speech recognition text consistency processing method and device - Google Patents

Speech recognition text consistency processing method and device Download PDF

Info

Publication number
CN111832308B
CN111832308B CN202010694673.7A CN202010694673A CN111832308B CN 111832308 B CN111832308 B CN 111832308B CN 202010694673 A CN202010694673 A CN 202010694673A CN 111832308 B CN111832308 B CN 111832308B
Authority
CN
China
Prior art keywords
sentence
word
sentences
word embedding
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010694673.7A
Other languages
Chinese (zh)
Other versions
CN111832308A (en
Inventor
缪庆亮
吴仁守
朱钦佩
朱少华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sipic Technology Co Ltd
Original Assignee
Sipic Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sipic Technology Co Ltd filed Critical Sipic Technology Co Ltd
Priority to CN202010694673.7A priority Critical patent/CN111832308B/en
Publication of CN111832308A publication Critical patent/CN111832308A/en
Application granted granted Critical
Publication of CN111832308B publication Critical patent/CN111832308B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a method and a device for processing consistency of speech recognition texts, wherein the method for processing consistency of speech recognition texts comprises the following steps: identifying a starting position of at least one key information in the voice recognition text; taking a plurality of sentences from the initial position, calculating second word embedding corresponding to the sentences according to the first word embedding of each word or each phrase in the sentences, and calculating third word embedding corresponding to the text fragments according to the second word embedding; calculating a similarity between the sentence and the other sentence, a distance decay between the sentence and the starting sentence, and a consistency between the sentence and the starting sentence based at least on the first word embedding, the second word embedding, and the third word embedding; constructing a semantic graph based on the similarity, and calculating the importance degree of sentences according to the semantic graph; and obtaining one or more clustering centers by using a graph clustering algorithm, calculating the sum value of the similarity, the consistency, the importance and the distance attenuation, and taking the sentences with the sum value of n ranked before as a coherent sentence sequence.

Description

Speech recognition text consistency processing method and device
Technical Field
The application belongs to the technical field of post-processing of voice recognition, and particularly relates to a method and a device for processing consistency of voice recognition texts.
Background
In the related art, the speech recognition system ASR (Automatic Speech Recognition) recognizes that there is an error in the sentence breaking of the sentence in the result, which causes problems in text analysis such as quality inspection and meeting abstract after speech transcription. Text analysis systems face problems such as ASR recognition result incoherence.
The current method for judging whether sentences are coherent mainly comprises the following steps:
method based on acoustic features: the prediction of the whole sentence is performed according to pauses or prosody (prosody) of a person speaking.
Text feature based method: language model modeling or sequence annotation modeling is used to predict whether a word is followed by a mark of sentence end.
Disclosure of Invention
The embodiment of the application provides a method and a device for processing consistency of voice recognition texts, which are used for at least solving one of the technical problems.
In a first aspect, an embodiment of the present application provides a method for processing consistency of speech recognition text, including: identifying the initial position of at least one piece of key information in a voice recognition text through a preset classification template or a preset classification model, wherein the preset classification template or the preset classification model is formed based on preset keywords, and the key information is content corresponding to the preset keywords; taking a plurality of sentences from the initial position, calculating second word embedding corresponding to each sentence according to first word embedding of each word or each phrase in each sentence, and calculating third word embedding corresponding to a text segment formed by the plurality of sentences according to the second word embedding; calculating a similarity between the each sentence and other sentences, a distance decay between the each sentence and a starting sentence, and a consistency between each sentence and a starting sentence based at least on the first word embedding, the second word embedding, and the third word embedding; constructing a semantic graph based on the similarity, and calculating the importance degree of each sentence according to the semantic graph; and obtaining one or more clustering centers by using a graph clustering algorithm, calculating the sum value of the similarity, the consistency, the importance degree and the distance attenuation of each clustering center, and taking the sentences with the sum value of n ranked before as a coherent sentence sequence.
In a second aspect, an embodiment of the present application provides a speech recognition text consistency processing apparatus, including: the recognition module is configured to recognize the starting position of at least one piece of key information in the voice recognition text through a preset classification template or a preset classification model, wherein the preset classification template or the preset classification model is formed based on preset keywords, and the key information is content corresponding to the preset keywords; the embedding module is configured to take a plurality of sentences from the initial position, calculate second word embedding corresponding to each sentence according to first word embedding of each word or each phrase in each sentence, and calculate third word embedding corresponding to a text segment formed by the plurality of sentences according to the second word embedding; a first calculation module configured to calculate a similarity between the each sentence and other sentences, a distance decay between the each sentence and a starting sentence, and a consistency between each sentence and a starting sentence based at least on the first word embedding, the second word embedding, and the third word embedding; the construction module is configured to construct a semantic graph based on the similarity and calculate the importance degree of each sentence according to the semantic graph; and the second calculation module is configured to acquire one or more clustering centers by using a graph clustering algorithm, calculate the sum value of the similarity, the consistency, the importance and the distance attenuation of each clustering center, and take the sentence with the sum value of n ranked top as a coherent sentence sequence.
In a third aspect, there is provided a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform the steps of the speech recognition text consistency processing method of the first aspect.
In a fourth aspect, an embodiment of the present application further provides an electronic device, including: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of the first aspect.
The method provided by the embodiment of the application identifies the initial position of key information through the preset classification template or the preset classification model, then starts to take a text segment from the initial position, makes words for each sentence and the text segment to be embedded, calculates the semantic similarity of each sentence and the text segment, combines the distance information between the sentences, gives certain semantic similarity attenuation, and finally selects N sentences as final results, thereby realizing the determination of the sentences belonging to the continuity in the text segment.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a method for processing consistency of speech recognition text according to an embodiment of the present application;
FIG. 2 is a flowchart of another method for processing consistency of speech recognition text according to an embodiment of the present application;
FIG. 3 is a flowchart of another method for processing consistency of speech recognition text according to an embodiment of the present application;
FIG. 4 is a flowchart of another method for processing consistency of speech recognition text according to an embodiment of the present application;
FIG. 5 is a system flow diagram of one embodiment of a scheme for speech recognition text consistency processing in accordance with an embodiment of the present application;
FIG. 6 is a flow chart of a vector representation of sentences and documents of a specific embodiment of a scheme for speech recognition text consistency processing of an embodiment of the present application;
FIG. 7 is a flowchart of a sentence and text segment similarity output for one embodiment of a scheme for speech recognition text consistency processing in accordance with an embodiment of the present application;
FIG. 8 is a block diagram of a speech recognition text continuity processing device according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
Referring to fig. 1, a flowchart of an embodiment of a method for processing speech recognition text consistency according to the present application is shown,
as shown in fig. 1, in step 101, identifying a starting position of at least one key information in a speech recognition text by a preset classification template or a preset classification model, wherein the preset classification template or the preset classification model is formed based on a preset keyword, and the key information is content corresponding to the preset keyword;
in step 102, a plurality of sentences are taken from the initial position, second word embeddings corresponding to each sentence are calculated according to the first word embeddings of each word or each phrase in each sentence, and third word embeddings corresponding to text fragments formed by the plurality of sentences are calculated according to the second word embeddings;
calculating, in step 103, a similarity between said each sentence and other sentences, a distance decay between said each sentence and a starting sentence, and a consistency between each sentence and a starting sentence based at least on said first word embedding, said second word embedding, and said third word embedding;
in step 104, a semantic graph is constructed based on the similarity, and the importance degree of each sentence is calculated according to the semantic graph;
in step 105, one or more cluster centers are obtained by using a graph clustering algorithm, the sum of the similarity, the consistency, the importance and the distance attenuation of each cluster center is calculated, and the sentences with the sum of n top ranking are taken as the consecutive sentence sequences.
In this embodiment, for step 101, the speech recognition text consistency processing apparatus identifies a start position of key information of at least one content corresponding to a preset keyword in the speech recognition text by using a preset classification template formed based on the preset keyword or a preset classification model, for example, a preset classification template with a conference theme, a conference time, and a conference place is preset, for example, a speech recognition text is: today we discuss the problem of item a, the time of meeting is set to 4 pm, the place is set in the meeting room; the preset classification template can identify the first key information in the speech recognition text by using the preset key words: item a, second key information: 4 pm and third key information: meeting room, wherein, the initial position of first key information is: the starting position of the second key information is: the starting position of the third key information is: and (3) ground.
Then, for step 102, the speech recognition text consistency processing device starts to take a plurality of sentences from the starting position, calculates a second word embedding corresponding to each sentence according to the first word embedding of each word or each phrase in each sentence, and calculates a third word embedding corresponding to a text segment composed of the plurality of sentences according to the second word embedding, wherein the word embedding is to convert each word or each phrase (word) in each sentence into a vector (vector) representation of each word or each phrase, for example, one text segment is: today we discuss the problem of item a, the time of meeting is set to 4 pm, the place is set in the meeting room; the method comprises the following steps of: today, we, discuss, let, A project, question, meeting, time, set, 4 pm, place, set, meeting room vector representation, then calculate from today, we, discuss, let, A project, and question vector representations of sentences in which we discuss A project today, calculate from meeting, time, set, and 4 pm vector representations of sentences in which we get meeting, set, meeting room vector representations of sentences in which we get place set, meeting room vector representations of sentences in which we get meeting, finally calculate from today's discussion of A project problem, set 4 pm vector representations in which we get today's discussion of A project, set, and set to meeting room vector representations of which we get today's discussion of A project, set to 4 pm, set to meeting room text segments.
Then, for step 103, the speech recognition text consistency processing device may calculate the similarity between each sentence and other sentences according to the first word embedding, the second word embedding and the third word embedding calculated above, and then calculate the distance attenuation between each sentence and the starting sentence and the consistency between each sentence and the starting sentence.
Then, for step 104, the speech recognition text consistency processing device constructs a semantic graph based on the similarity, and calculates the importance of each sentence according to the semantic graph, wherein the semantic graph model is a new research view of language type, which has been paid attention in recent years, and aims to characterize the versatility of grammar form by using geometric figures and reveal the systematism and regularity of grammar form multifunctional modes in human language.
Finally, for step 105, one or more cluster centers are obtained using a graph clustering algorithm, the sum of the similarity, the consistency, the importance and the distance decay of each cluster center is calculated, and the sentence with the sum n being ranked as a consecutive sentence sequence is taken, wherein the cluster analysis is a common machine learning technique, and the purpose of the cluster analysis is to divide a data point into several classes. The data of the same class have higher similarity, and the similarity between different classes is lower.
In the scheme of the embodiment, the starting position of key information is identified through a preset classification template or a preset classification model, then a text segment is taken from the starting position, each sentence and the text segment are embedded in a word, the semantic similarity of each sentence and the text segment is calculated, certain semantic similarity attenuation is given by combining the distance information between the sentences, and finally N sentences are selected as final results, so that the determination of the sentences belonging to the continuity in the text segment can be realized.
Referring to fig. 2, a flowchart of another method for processing consistency of speech recognition text according to an embodiment of the present application is shown, where the flowchart is mainly a flowchart of steps for further defining "construct semantic graph based on the similarity and calculate importance of each sentence according to the semantic graph" in step 104 of fig. 1.
As shown in fig. 2, in step 201, each sentence is taken as a node of the semantic graph, and edges between the nodes represent the similarity between each sentence and other sentences;
in step 202, the importance of each sentence is calculated using texttrank algorithm based on the similarity.
In this embodiment, for step 201, the speech recognition text consistency processing device uses each sentence as a node of the semantic graph, and the edges between the nodes represent the similarity between each sentence and other sentences, for example, the semantic similarity between the ith sentence and the jth sentence is S (i, j). Establishing an NxN semantic graph among sentences, wherein nodes in the semantic graph are sentences, edges among the nodes represent semantic relativity, and the relativity is represented by S (i, j);
then, for step 202, the speech recognition text consistency processing means calculates the importance level of each sentence using texttrank algorithm based on the similarity, for example, the importance level of the i-th sentence may be represented by S3 (i).
In the scheme of the embodiment, semantic similarity between sentences is calculated by constructing a semantic graph, so that the importance degree of each sentence can be calculated by using a texttrank algorithm.
In some alternative embodiments, the preset classification template is composed of the preset keyword and a template, and the method further includes: training the preset classification model by using the template and the preset keywords, so that the preset classification model can identify the key information in the voice recognition text, and the key information of the voice recognition text can be identified.
Referring to fig. 3, a flowchart of another method for processing consistency of speech recognition text according to an embodiment of the present application is shown, where the flowchart mainly refers to a flowchart of step 102 "in fig. 1, in which a plurality of sentences are taken from the start position, a second word embedding corresponding to each sentence is calculated according to a first word embedding of each word or each phrase in each sentence, and a third word embedding corresponding to a text segment composed of the plurality of sentences is calculated according to the second word embedding, and the flowchart includes" further defined steps ".
As shown in fig. 3, in step 301, a plurality of sentences are taken from the starting position, and the first word of each word or each phrase in each sentence is embedded and accumulated to obtain a second word corresponding to each sentence;
in step 302, the second word is embedded and accumulated to obtain a third word corresponding to the text segment composed of the plurality of sentences.
In this embodiment, for step 301, the speech recognition text consistency processing device starts to take a plurality of sentences from the starting position to form a text segment, performs word embedding, converts each word or each phrase in each sentence into a vector representation of each word or each phrase, that is, first word embedding, and then directly adds up the vector representation of each word or each phrase to obtain a vector representation corresponding to each sentence, that is, second word embedding.
Finally, for step 302, the vector representations corresponding to the text segments formed by the plurality of sentences, that is, the third word embedding, are obtained by direct accumulation of the vector representations of each sentence.
For example, one text segment is: today we discuss the problem of item a, the time of meeting is set to 4 pm, the place is set in the meeting room; the method comprises the following steps of: today, we, discuss, let, A project, question, meeting, time, fixed, 4 pm, place, fixed, meeting room vector representation, then by accumulating the vector of today + we + discussion + a project + question, such direct accumulation gets the vector representation of the sentence of today's question we discuss a project, then directly accumulating the vector representation of the sentence to get the vector representation of the text segment.
In the scheme of the embodiment, the word embedding of each sentence and the word embedding of the text segment can be used for realizing the vector representation of each sentence and the text segment.
Referring to fig. 4, a flowchart of another method for processing consistency of speech recognition text according to an embodiment of the present application is shown, and the flowchart is mainly a flowchart of steps further defined in step 103 "calculate similarity between each sentence and other sentences based on at least the first word embedding, the second word embedding and the third word embedding, the distance attenuation between each sentence and the initial sentence, and the consistency between each sentence and the initial sentence" in the flowchart of fig. 1.
As described in fig. 4, in step 401, the similarity of the second word embedding and the third word embedding is calculated by cosine distance decay;
in step 402, calculating a distance attenuation from each sentence to the start sentence based on a preset distance attenuation formula;
in step 403, it is determined whether a conjunctive word is included between each sentence and the other sentences or whether each sentence and the other sentences include a common physical word, so as to calculate the consistency between each sentence and the starting sentence.
In this embodiment, for step 401, the speech recognition text consistency processing means calculates the similarity between the second word embedding and the third word embedding by means of cosine distance attenuation between the second word embedding and the third word embedding; then, for step 402, the speech recognition text consistency processing means calculates a distance decay from each sentence to the starting sentence based on a preset distance decay formula; for example, the distance between the semantic vector of each sentence and the whole semantic vector of the text fragment is calculated through cosine distance, and the larger the score is, the more semantically the sentence is matched with the whole semantic of the text fragment; finally, for step 403, the speech recognition text consistency processing device determines whether each sentence contains a conjunctive word with other sentences or whether each sentence contains a common entity word with other sentences, so as to calculate the consistency between each sentence and the starting sentence, for example, whether there are conjunctions such as "sum" and "still" between sentences, or whether there are naming entity words such as a common name, for example, S3 (i) may be used to represent the importance of the i-th sentence.
In the scheme of the embodiment, the semantic similarity between each sentence and the text fragment can be calculated by utilizing cosine distance attenuation through word embedding of each sentence and word embedding of the text fragment, and the importance degree of each sentence can be calculated by judging whether each sentence contains a conjunctive word or a public entity word.
In the method of the foregoing embodiment, the preset distance attenuation formula is:
θ(l)=N 0 e -λl
wherein N is 0 =1.0, λ is a preset threshold, and l is the distance from the current sentence to the starting sentence.
In some alternative embodiments, the method further comprises: in response to an audio input or recording by a user, the audio input or recording is converted into speech recognition text.
The following description is given to better understand the solution of the present application by describing some problems encountered by the inventor in the process of implementing the present application and describing one specific embodiment of the finally determined solution.
Drawbacks of these similar techniques:
method based on acoustic features: predicting the whole sentence according to the pause or prosody (prosody) of the person when speaking; because the end of the sentence and the pause time in the voice information are not necessarily related, the speaking speed of each person in each context is different, and the pause time threshold is difficult to set, so that the accuracy of the method is low. Also, sentence end punctuation marks such as periods, question marks, exclamation marks, and the like cannot be distinguished.
Text feature based method: language model modeling or sequence annotation modeling is used to predict whether a word is followed by a mark of sentence end. The model trained by the method has poor generalization performance, and the using habits of punctuation marks are different in different contexts, so that the final effect of the model is limited.
Why it is not easy to think of the reason:
current methods for judging sentence consistency are often based on acoustic and language models, and common practices include:
method based on acoustic features: i.e. predicting the whole sentence according to the pause or prosody (prosody) of the person speaking; the method for solving the defects of the method is to dynamically adjust the threshold value of the pause interval, and set different threshold values according to acoustic characteristics such as speech speed of each person.
Whether a word is followed by a mark of sentence end is predicted based on language model modeling or sequence labeling modeling. The problem with this type of approach is poor generalization and the expansion to other areas also requires retraining or tuning of the model. The solution is that a large-scale pre-trained model such as BERT, etc. can be used.
The technical problems existing in the prior art are solved through the following scheme:
the scheme provided by the patent not only considers language characteristics, but also considers semantic information among sentences more, and focuses on semantic association among sentences more, and the consistency of text fragments is recognized instead of the consistency inside the sentences. Meanwhile, the method can realize the identification of discontinuous sentences by using a graph ordering algorithm. Such as the sentence ABCDEF, where the ABDF may be the conclusion of the meeting and the CE is the meeting to be handled. The identification of non-consecutive sentences as identical semantically consecutive fragments may be achieved. This is not possible with conventional methods.
Taking a summary generation task after converting a conference recording into a text as an example, firstly, key information such as a conference theme, a conference conclusion, a start position of important information such as a conference to-be-handled and the like is identified through a rule template, a keyword or a classification model. Then, a text segment with a certain length is selected by taking the first sentence as a starting point, wherein the text segment comprises a plurality of sentences. Each sentence and text fragment is then embedded. And calculating the semantic similarity of each sentence and the fragment. And combining distance information among sentences, giving certain semantic similarity attenuation, and finally selecting n sentences as a final result. The application combines a plurality of information such as keyword information, position information, semantic similarity and the like. And determining that the fragments belong to the coherent sentences through a sequencing algorithm.
The application has the technical innovation points that:
important information pre-positioning method based on keywords
Method for calculating semantic similarity between sentences
Sentence sorting and selecting method
The flow of the method is shown in fig. 1, and the starting position of the key information is first identified, and the key words, templates or classification models can be utilized. The following are some templates and keywords. The keywords and templates can also be used for training a classification model, and the model can identify key information such as topics, time, places and the like.
Conference theme:
1. today we discuss the problem of xxxx
2. We chat a chat xxxxx
3. Today we are working on (talk chat discussion explore) xxxx things
Meeting time:
breakfast meeting
Meeting at the afternoon of today
The meeting time is afternoon
Meeting place:
we are in xxx meeting
Our meeting place is xxxx
Meeting place at xxxxxx
Meeting person information:
today meeting is xxx, xxx, xxx and xxx
xxxx, xxx, xxx together open
3. Participants and conferees have xxxx
Meeting to be done:
backlog 1 is/has xxxx
To-do matter is/has xxx
To-do is/has xxx
Meeting to-be-handled responsible person:
is responsible for xxxx. xxxx assistance reporting to xxxx
Responsible person is xxx
Responsible persons are xxx and xxxx
Conference conclusion:
conference conclusion is xxx
In summary, xxxxx
Overall, xxxxx
And secondly, starting from the initial position determined in the first step, taking N sentences. Directly accumulating Word references (Word Embedding) of each Word or phrase in the sentence, thereby obtaining Word references representation of the sentence; and directly accumulating the WordEmbedding of each sentence to obtain the Word Embedding of the text fragments consisting of N sentences.
The distance between the semantic vector of each sentence and the whole semantic vector of the text fragment is calculated through cosine distance, and the larger the score is, the more semantically matched the sentence with the whole semantic of the text fragment is. And calculating the semantic similarity score of the whole semantics of each sentence and the text fragment, and expressing the semantic similarity of the ith sentence by using S1 (i).
Third, the distance decay is calculated by the following equation (1), where n0=1.0, λ is adjusted as needed, and l is the distance from the current sentence to the starting sentence.
θ(l)=N 0 e -λl (1)
Semantic similarity between the nth sentence and the previous N-1 sentence, and the value of N is 2 to N.
Step four, calculating the continuity, wherein the continuity comprises 1, whether the sentences have 'sum' or not, and the like, and if the continuity is 1.0; 2. whether public name and other named entity words exist among sentences or not, and if the consistency is 1.0; the consistency of the i-th sentence and the initial sentence is represented by S2 (i).
Fifthly, constructing a semantic graph, and calculating semantic similarity among sentences, wherein the method is the same as that in the step 2. For example, the semantic similarity between the i-th sentence and the j-th sentence is S (i, j). And establishing an N-N semantic graph among sentences, wherein nodes in the semantic graph are sentences, edges among the nodes represent semantic relativity, and the relativity is represented by S (i, j). The importance degree of each node (sentence) is calculated by using a texttrank algorithm, and the importance degree of the ith sentence is represented by S3 (i).
And sixthly, finding one or more clustering centers by using a graph clustering algorithm, and then calculating S=S1 (i) +S2 (i) +S3 (i) +theta (l) for each clustering center, and taking sentences with the top n of the S rank as a continuous sentence sequence.
Referring to fig. 8, a block diagram of a speech recognition text consistency processing device according to an embodiment of the application is shown.
As shown in fig. 8, the speech recognition text continuity processing device 800 includes: the identification module 810, the embedding module 820, the first calculation module 830, the construction module 840, and the second calculation module 850.
The recognition module 810 is configured to recognize a starting position of at least one piece of key information in the voice recognition text through a preset classification template or a preset classification model, wherein the preset classification template or the preset classification model is formed based on preset keywords, and the key information is content corresponding to the preset keywords; an embedding module 820 configured to take a plurality of sentences from the starting position, calculate a second word embedding corresponding to each sentence according to a first word embedding of each word or each phrase in each sentence, and calculate a third word embedding corresponding to a text segment composed of the plurality of sentences according to the second word embedding; a first calculation module 830 configured to calculate a similarity between the each sentence and other sentences, a distance decay between the each sentence and the starting sentence, and a consistency between each sentence and the starting sentence based at least on the first word embedding, the second word embedding, and the third word embedding; a construction module 840 configured to construct a semantic graph based on the similarity, and calculate the importance level of each sentence according to the semantic graph; and a second calculation module 850 configured to obtain one or more cluster centers by using a graph clustering algorithm, calculate a sum value of the similarity, the consistency, the importance and the distance attenuation of each cluster center, and take a sentence with the sum value ranked n top as a consecutive sentence sequence.
It should be understood that the modules depicted in fig. 8 correspond to the various steps in the method described with reference to fig. 1, 2, 3, and 4. Thus, the operations and features described above for the method and the corresponding technical effects are equally applicable to the modules in fig. 8, and are not described here again.
It should be noted that the module in the embodiment of the present application is not limited to the solution of the present application, for example, the recognition module may be described as recognizing the starting position of at least one key information in the speech recognition text through a preset classification template or a preset classification model, where the preset classification template or the preset classification model is formed based on a preset keyword, the key information is a content corresponding to the preset keyword, and in addition, the related functional module may be implemented by a hardware processor, for example, the recognition module may also be implemented by a processor, which is not described herein.
In other embodiments, embodiments of the present application further provide a non-volatile computer storage medium storing computer executable instructions that are capable of performing the method for processing consistency of speech recognition text in any of the above-described method embodiments;
as one embodiment, the non-volatile computer storage medium of the present application stores computer-executable instructions configured to:
identifying the initial position of at least one piece of key information in a voice recognition text through a preset classification template or a preset classification model, wherein the preset classification template or the preset classification model is formed based on preset keywords, and the key information is content corresponding to the preset keywords;
taking a plurality of sentences from the initial position, calculating second word embedding corresponding to each sentence according to first word embedding of each word or each phrase in each sentence, and calculating third word embedding corresponding to a text segment formed by the plurality of sentences according to the second word embedding;
calculating a similarity between the each sentence and other sentences, a distance decay between the each sentence and a starting sentence, and a consistency between each sentence and a starting sentence based at least on the first word embedding, the second word embedding, and the third word embedding;
constructing a semantic graph based on the similarity, and calculating the importance degree of each sentence according to the semantic graph;
and obtaining one or more clustering centers by using a graph clustering algorithm, calculating the sum value of the similarity, the consistency, the importance degree and the distance attenuation of each clustering center, and taking the sentences with the sum value of n ranked before as a coherent sentence sequence.
The non-transitory computer readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created from the use of the speech recognition text continuity processing means, etc. Further, the non-volatile computer-readable storage medium may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes a memory remotely located with respect to the processor, the remote memory being connectable to the speech recognition text continuity processing means through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
Embodiments of the present application also provide a computer program product comprising a computer program stored on a non-volatile computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform any of the above-described speech recognition text consistency processing methods.
Fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application, as shown in fig. 9, where the device includes: one or more processors 910, and a memory 920, one processor 910 being illustrated in fig. 9. The apparatus for a speech recognition text consistency processing method may further include: an input device 930, and an output device 940. The processor 910, memory 920, input device 930, and output device 940 may be connected by a bus or other means, for example in fig. 9. Memory 920 is the non-volatile computer-readable storage medium described above. The processor 910 executes various functional applications of the server and data processing by running non-volatile software programs, instructions and modules stored in the memory 920, i.e. implements the method embodiments described above for speech recognition text continuity processing means methods. The input device 930 may receive input numeric or character information and generate key signal inputs related to user settings and function controls for the speech recognition text continuity processing device. The output device 940 may include a display device such as a display screen.
The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. Technical details not described in detail in this embodiment may be found in the methods provided in the embodiments of the present application.
As an embodiment, the electronic device is applied to a speech recognition text consistency processing device, and includes:
at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to:
identifying the initial position of at least one piece of key information in a voice recognition text through a preset classification template or a preset classification model, wherein the preset classification template or the preset classification model is formed based on preset keywords, and the key information is content corresponding to the preset keywords;
taking a plurality of sentences from the initial position, calculating second word embedding corresponding to each sentence according to first word embedding of each word or each phrase in each sentence, and calculating third word embedding corresponding to a text segment formed by the plurality of sentences according to the second word embedding;
calculating a similarity between the each sentence and other sentences, a distance decay between the each sentence and a starting sentence, and a consistency between each sentence and a starting sentence based at least on the first word embedding, the second word embedding, and the third word embedding;
constructing a semantic graph based on the similarity, and calculating the importance degree of each sentence according to the semantic graph;
and obtaining one or more clustering centers by using a graph clustering algorithm, calculating the sum value of the similarity, the consistency, the importance degree and the distance attenuation of each clustering center, and taking the sentences with the sum value of n ranked before as a coherent sentence sequence.
The electronic device of the embodiments of the present application exists in a variety of forms including, but not limited to:
(1) A mobile communication device: such devices are characterized by mobile communication capabilities and are primarily aimed at providing voice, data communications. Such terminals include smart phones, multimedia phones, functional phones, low-end phones, and the like.
(2) Ultra mobile personal computer device: such devices are in the category of personal computers, having computing and processing functions, and generally also having mobile internet access characteristics. Such terminals include: PDA, MID, and UMPC devices, etc.
(3) Portable entertainment device: such devices may display and play multimedia content. The device comprises an audio player, a video player, a palm game machine, an electronic book, an intelligent toy and a portable vehicle navigation device.
(4) The server is similar to a general computer architecture in that the server is provided with high-reliability services, and therefore, the server has high requirements on processing capacity, stability, reliability, safety, expandability, manageability and the like.
(5) Other electronic devices with data interaction function.
The apparatus embodiments described above are merely illustrative, wherein elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present application without undue burden.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims (8)

1. A method of speech recognition text consistency processing, comprising:
identifying the initial position of at least one piece of key information in a voice recognition text through a preset classification template or a preset classification model, wherein the preset classification template or the preset classification model is formed based on preset keywords, and the key information is content corresponding to the preset keywords;
taking a plurality of sentences from the starting position, calculating a second word embedding corresponding to each sentence according to a first word embedding of each word or each phrase in each sentence, and calculating a third word embedding corresponding to a text segment formed by the plurality of sentences according to the second word embedding, wherein the first word embedding is to convert each word or each phrase in each sentence into a vector representation of each word or each phrase, the second word embedding is to directly accumulate the vector representation of each word or each phrase to obtain a vector representation corresponding to each sentence, and the third word embedding is to directly accumulate the vector representation of each sentence to obtain a vector representation corresponding to the text segment formed by the plurality of sentences;
calculating a similarity between the each sentence and other sentences, a distance decay between the each sentence and a starting sentence, and a consistency between each sentence and a starting sentence based on the first word embedding, the second word embedding, and the third word embedding;
calculating the similarity of the second word embedding and the third word embedding through cosine distance attenuation;
calculating the distance attenuation from each sentence to the initial sentence based on a preset distance attenuation formula, wherein the preset distance attenuation formula is as follows:
wherein N is 0 =1.0,For a preset threshold, l is the distance from the current sentence to the starting sentence;
judging whether each sentence and the other sentences contain continuous words or whether each sentence and the other sentences contain public entity words or not so as to calculate the consistency of each sentence and the initial sentence;
constructing a semantic graph based on the similarity, and calculating the importance degree of each sentence according to the semantic graph;
and obtaining one or more clustering centers by using a graph clustering algorithm, calculating the sum value of the similarity, the consistency, the importance degree and the distance attenuation of each clustering center, and taking the sentences with the sum value of n ranked before as a coherent sentence sequence.
2. The method of claim 1, wherein the constructing a semantic graph based on the similarity and calculating the importance of each sentence from the semantic graph comprises:
taking each sentence as a node of a semantic graph, and representing the similarity between each sentence and other sentences by edges between the nodes;
and calculating the importance degree of each sentence by using a texttrank algorithm based on the similarity.
3. The method of claim 1, wherein the preset classification template consists of the preset keywords and templates, the method further comprising:
training the preset classification model by using the template and the preset keywords, so that the preset classification model can identify the key information in the voice recognition text.
4. The method of claim 1, wherein the taking a plurality of sentences from the starting position, calculating a second word insert corresponding to each sentence from a first word insert of each word or each phrase in each sentence, and calculating a third word insert corresponding to a text segment composed of the plurality of sentences from the second word insert comprises:
taking a plurality of sentences from the initial position, and embedding and accumulating the first word of each word or each phrase in each sentence to obtain a second word embedding corresponding to each sentence;
and embedding and accumulating the second words to obtain third word embedding corresponding to the text fragments formed by the sentences.
5. The method of any of claims 1-4, wherein the method further comprises:
in response to an audio input or recording of a user, the audio input or recording is converted into speech recognition text.
6. A speech recognition text continuity processing device comprising:
the recognition module is configured to recognize the starting position of at least one piece of key information in the voice recognition text through a preset classification template or a preset classification model, wherein the preset classification template or the preset classification model is formed based on preset keywords, and the key information is content corresponding to the preset keywords;
an embedding module configured to take a plurality of sentences from the starting position, calculate a second word embedding corresponding to each sentence according to a first word embedding of each word or each phrase in each sentence, calculate a third word embedding corresponding to a text segment composed of the plurality of sentences according to the second word embedding, wherein the first word embedding is to convert each word or each phrase in each sentence into a vector representation of each word or each phrase, the second word embedding is to directly accumulate the vector representation of each word or each phrase to obtain a vector representation corresponding to each sentence, and the third word embedding is to directly accumulate the vector representation of each sentence to obtain a vector representation corresponding to the text segment composed of the plurality of sentences;
a first calculation module configured to calculate a similarity between the each sentence and other sentences, a distance decay between the each sentence and a starting sentence, and a consistency between each sentence and a starting sentence based on the first word embedding, the second word embedding, and the third word embedding; the first computing module further includes: the similarity of the second word embedding and the third word embedding is calculated through cosine distance attenuation; calculating the distance attenuation from each sentence to the initial sentence based on a preset distance attenuation formula, wherein the preset distance attenuation formula is as follows:
the method comprises the steps of carrying out a first treatment on the surface of the Wherein n0=1.0, +.>For a preset threshold, l is the distance from the current sentence to the starting sentence;judging whether each sentence and the other sentences contain continuous words or whether each sentence and the other sentences contain public entity words or not so as to calculate the consistency of each sentence and the initial sentence;
the construction module is configured to construct a semantic graph based on the similarity and calculate the importance degree of each sentence according to the semantic graph;
and the second calculation module is configured to acquire one or more clustering centers by using a graph clustering algorithm, calculate the sum value of the similarity, the consistency, the importance and the distance attenuation of each clustering center, and take the sentences with the sum value of n ranked before as a coherent sentence sequence.
7. A computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform the steps of the method of any of claims 1-5.
8. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1 to 5.
CN202010694673.7A 2020-07-17 2020-07-17 Speech recognition text consistency processing method and device Active CN111832308B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010694673.7A CN111832308B (en) 2020-07-17 2020-07-17 Speech recognition text consistency processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010694673.7A CN111832308B (en) 2020-07-17 2020-07-17 Speech recognition text consistency processing method and device

Publications (2)

Publication Number Publication Date
CN111832308A CN111832308A (en) 2020-10-27
CN111832308B true CN111832308B (en) 2023-09-08

Family

ID=72923612

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010694673.7A Active CN111832308B (en) 2020-07-17 2020-07-17 Speech recognition text consistency processing method and device

Country Status (1)

Country Link
CN (1) CN111832308B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112597278A (en) * 2020-12-25 2021-04-02 北京知因智慧科技有限公司 Semantic information fusion method and device, electronic equipment and storage medium
CN113011169B (en) * 2021-01-27 2022-11-11 北京字跳网络技术有限公司 Method, device, equipment and medium for processing conference summary
CN113705232B (en) * 2021-03-03 2024-05-07 腾讯科技(深圳)有限公司 Text processing method and device
CN113743125A (en) * 2021-09-07 2021-12-03 广州晓阳智能科技有限公司 Text continuity analysis method and device
CN114611524B (en) * 2022-02-08 2023-11-17 马上消费金融股份有限公司 Text error correction method and device, electronic equipment and storage medium
CN115526173A (en) * 2022-10-12 2022-12-27 湖北大学 Feature word extraction method and system based on computer information technology

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104391942A (en) * 2014-11-25 2015-03-04 中国科学院自动化研究所 Short text characteristic expanding method based on semantic atlas
CN107967257A (en) * 2017-11-20 2018-04-27 哈尔滨工业大学 A kind of tandem type composition generation method
CN108197111A (en) * 2018-01-10 2018-06-22 华南理工大学 A kind of text automatic abstracting method based on fusion Semantic Clustering
CN108549637A (en) * 2018-04-19 2018-09-18 京东方科技集团股份有限公司 Method for recognizing semantics, device based on phonetic and interactive system
CN110287309A (en) * 2019-06-21 2019-09-27 深圳大学 The method of rapidly extracting text snippet
CN110457466A (en) * 2019-06-28 2019-11-15 谭浩 Generate method, computer readable storage medium and the terminal device of interview report

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104391942A (en) * 2014-11-25 2015-03-04 中国科学院自动化研究所 Short text characteristic expanding method based on semantic atlas
CN107967257A (en) * 2017-11-20 2018-04-27 哈尔滨工业大学 A kind of tandem type composition generation method
CN108197111A (en) * 2018-01-10 2018-06-22 华南理工大学 A kind of text automatic abstracting method based on fusion Semantic Clustering
CN108549637A (en) * 2018-04-19 2018-09-18 京东方科技集团股份有限公司 Method for recognizing semantics, device based on phonetic and interactive system
CN110287309A (en) * 2019-06-21 2019-09-27 深圳大学 The method of rapidly extracting text snippet
CN110457466A (en) * 2019-06-28 2019-11-15 谭浩 Generate method, computer readable storage medium and the terminal device of interview report

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王志宏 ; 过弋 ; .基于词句重要性的中文专利关键词自动抽取研究.情报理论与实践.2018,第41卷(第9期),第123-129页. *

Also Published As

Publication number Publication date
CN111832308A (en) 2020-10-27

Similar Documents

Publication Publication Date Title
CN111832308B (en) Speech recognition text consistency processing method and device
EP3652733B1 (en) Contextual spoken language understanding in a spoken dialogue system
WO2021093449A1 (en) Wakeup word detection method and apparatus employing artificial intelligence, device, and medium
US20210142794A1 (en) Speech processing dialog management
CN109509470B (en) Voice interaction method and device, computer readable storage medium and terminal equipment
CN107016994B (en) Voice recognition method and device
CN108255934B (en) Voice control method and device
CN110516253B (en) Chinese spoken language semantic understanding method and system
US11823678B2 (en) Proactive command framework
US11574637B1 (en) Spoken language understanding models
JP7300435B2 (en) Methods, apparatus, electronics, and computer-readable storage media for voice interaction
CN112530408A (en) Method, apparatus, electronic device, and medium for recognizing speech
WO2020238045A1 (en) Intelligent speech recognition method and apparatus, and computer-readable storage medium
US11398226B1 (en) Complex natural language processing
US11132994B1 (en) Multi-domain dialog state tracking
CN112017643B (en) Speech recognition model training method, speech recognition method and related device
US11043215B2 (en) Method and system for generating textual representation of user spoken utterance
CN113674742B (en) Man-machine interaction method, device, equipment and storage medium
US11990122B2 (en) User-system dialog expansion
CN114596844A (en) Acoustic model training method, voice recognition method and related equipment
CN112669842A (en) Man-machine conversation control method, device, computer equipment and storage medium
CN110851650A (en) Comment output method and device and computer storage medium
CN113761268A (en) Playing control method, device, equipment and storage medium of audio program content
US11626107B1 (en) Natural language processing
CN112446219A (en) Chinese request text intention analysis method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 215123 14 Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou, Jiangsu.

Applicant after: Sipic Technology Co.,Ltd.

Address before: 215123 14 Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou, Jiangsu.

Applicant before: AI SPEECH Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant