CN111753137A - Video searching method based on voice characteristics - Google Patents

Video searching method based on voice characteristics Download PDF

Info

Publication number
CN111753137A
CN111753137A CN202010604506.9A CN202010604506A CN111753137A CN 111753137 A CN111753137 A CN 111753137A CN 202010604506 A CN202010604506 A CN 202010604506A CN 111753137 A CN111753137 A CN 111753137A
Authority
CN
China
Prior art keywords
voice
neural network
method based
searching method
video searching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010604506.9A
Other languages
Chinese (zh)
Other versions
CN111753137B (en
Inventor
梁敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Changhong Electric Co Ltd
Original Assignee
Sichuan Changhong Electric Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Changhong Electric Co Ltd filed Critical Sichuan Changhong Electric Co Ltd
Priority to CN202010604506.9A priority Critical patent/CN111753137B/en
Publication of CN111753137A publication Critical patent/CN111753137A/en
Application granted granted Critical
Publication of CN111753137B publication Critical patent/CN111753137B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7834Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using audio features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Library & Information Science (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a video searching method based on voice characteristics, which comprises the following steps: extracting voice data to be searched into a plurality of multi-dimensional feature vectors; converting the extracted voice data of the multi-dimensional feature vector into a plurality of texts; importing the obtained texts into a time sequence neural network for training to obtain a feature vector of high latitude; the convolution neural network takes the feature vector of high latitude as input to carry out regression operation; comparing the characteristic vectors obtained by the convolutional neural network with the characteristic vectors corresponding to the database, selecting the result with the highest similarity, and testing to obtain the final accuracy; and outputting the selected result with the highest similarity by using video or audio, finishing the final search content, and feeding back the final search content to the user in a voice mode. The invention solves the technical problem that the accuracy of voice search is not high because the voice information change is tested badly and a label of enough data is difficult to be marked to ensure the precision of a search algorithm in the prior art.

Description

Video searching method based on voice characteristics
Technical Field
The invention relates to the field of artificial intelligence computer vision processing, in particular to a video searching method based on voice characteristics.
Background
The rapid development of the internet has promoted the economic and more intelligent traditional entity, and people are beginning to generate more demands close to life. As one of the mainstream communication applications of human-computer interaction in the intelligent terminal, the use frequency of voice interaction in reality is higher and higher, and the frequency of using the intelligent terminal to search multimedia resources is higher and higher. The existing intelligent terminal firstly converts voice into characters, fuzzy matching is carried out on the characters and the fields in the library file for searching, the corresponding semantics can be found by the fields in the database, and the corresponding semantics can not be found by the fields in the database. This method cannot search and recommend through the characteristics (voice, picture) of the multimedia resource itself, thus deteriorating the experience of the user who has such needs. Most often, in life, people want to search for media assets concerned by themselves, but do not know the name of a film source, and in such a case, equipment is needed to be capable of deeply analyzing voice information to recommend and search media.
For example, when a user searches for a movie of movie a, the current search method needs to know the name of the movie and yell "i want to watch movie a" with voice, the device receives the voice and then converts the voice recognition into characters, and after analyzing semantics, searches for the corresponding movie of the movie a; many times, however, the user does not know the title of the series and the user may only remember certain features, such as: movie a is the first television show of an actor 2016, and when(s) he shouts "i want to watch the first television show of the actor 2016", the device analyzes the speech by fields, and compares the fields in the database, and the corresponding video media cannot be searched because of the semantic meaning without matching the fields.
Disclosure of Invention
The invention aims to provide a video searching method based on voice characteristics, which is used for solving the technical problem that in the prior art, voice information change is not measured, and a label of enough data is difficult to be marked to ensure the precision of a searching algorithm, so that the accuracy of voice searching is not high.
The invention solves the problems through the following technical scheme:
a video searching method based on voice characteristics comprises the following steps:
step 1) extracting voice data to be searched into a plurality of multi-dimensional characteristic vectors;
step 2) converting the extracted voice data of the multi-dimensional feature vectors into a plurality of texts;
step 3) importing the plurality of texts obtained in the step 2) into a time sequence neural network for training to obtain a high-latitude feature vector;
step 4), the convolution neural network takes the characteristic vector of high latitude as input to carry out regression operation;
step 5) comparing the characteristic vectors after the convolution neural network regression operation with the characteristic vectors corresponding to the database, selecting the result with the highest similarity, and testing the selected result with the highest similarity to obtain the final accuracy;
and 6) outputting the result with the highest similarity selected in the step 5) by using video or audio, finishing the final search content, and feeding back the final search content to the user in a voice mode.
Preferably, the training set of the time-series neural network employs a CA8 training set.
Preferably, the samples of the CA8 training set are according to a training set, a validation set, and a training set: 8: 1: and (4) randomly cutting at a ratio of 1.
Preferably, the time-series neural network comprises 5 gating neural units which are connected in series.
Preferably, the interior of the gated neural unit includes a reset gate and an update gate.
Preferably, the convolutional neural network is arranged in a U-shaped structure, i.e., a decoder-encoder structure.
Preferably, the regression operation flow of the convolutional neural network is as follows: the decoder divides the acquired high-latitude characteristic vector into a plurality of linearly independent sub-dimension vectors, and obtains a middle vector of the plurality of sub-dimension vectors through convolution of a plurality of convolution layers with different dimensions; the encoder takes the intermediate vector as input and trains the convolutional neural network to carry out convolution combination to restore the original new characteristic diagram.
Compared with the prior art, the invention has the following advantages and beneficial effects:
the invention extracts the voice characteristics of multimedia in multiple dimensions through the time sequence neural network, and then searches the video by using the method of comparing the semantic acquired by the convolutional network combination with the user requirement, thereby improving the accuracy of voice search. The technical problems that the existing voice multimedia search system does not perform deep search according to voice information and users have more actual requirements to perform related search aiming at the characteristics in the voice are solved. The deep learning-based system is urgently needed, can complete multimedia resource recommendation in most of actual life, better optimizes user experience and improves applicability.
Drawings
FIG. 1 is an internal structure of a single gated neural unit of the present invention.
Fig. 2 shows a decoder-encoder structure according to the present invention.
FIG. 3 is an exemplary diagram of audio feature vectors according to the present invention.
Detailed Description
The present invention will be described in further detail with reference to examples, but the embodiments of the present invention are not limited thereto.
Example 1:
referring to fig. 1 to 3, a video search method based on voice features includes the following steps:
step 1) cutting the voice information to be searched into a plurality of small segments, extracting each segment of voice information into a feature vector, wherein each feature vector is a multi-dimensional feature vector, as shown in fig. 3.
And 2) converting all the extracted multi-dimensional feature vectors into texts.
Step 3), setting a training set of the time-series neural network as a CA8 training set, wherein samples of the CA8 training set are as follows: 8: 1: 1, random cutting in proportion, firstly initializing the parameters of a reset gate and an update gate of a time sequence neural network, then importing the text converted in the step 2) into the time sequence neural network, managing and training data of an entering unit by using the reset gate inside a gated neural unit, transferring the data required to be transferred to a next unit to a next component through the update gate after internal updating is completed, extracting the relevant text characteristics of the imported text in the step 2) by using 5 continuously connected gated neural units, and finally obtaining a characteristic vector with high latitude, wherein the structure of each gated neural unit is shown in figure 1.
Step 4), designing the convolutional neural network into a U-shaped structure, as shown in fig. 2, namely a structure of a decoder-encoder, designing a plurality of convolutional layers with different dimensions by the decoder, wherein the convolutional layers are respectively 64 × 64, 32 × 32, 16 × 16 and 8 × 8, splitting the high-latitude characteristic vector obtained in the step 3) into a plurality of linear independent sub-dimension vectors by the decoder, and finally obtaining a middle vector of the plurality of sub-dimension vectors through the plurality of convolutional layers with different dimensions by the decoder; the encoder takes the intermediate vector obtained by the decoder as input, restores the original new characteristic diagram through the convolution layers with different dimensionalities of the encoder according to the modular length and the direction of the intermediate vector, and performs characteristic diagram fusion by utilizing an upper sampling layer to obtain a final characteristic diagram.
And 5) comparing the final feature map with the feature maps in the specific database for similarity, selecting a result with the highest similarity, verifying the result with the highest similarity, and testing the final accuracy.
And 6) outputting the result with the highest similarity selected by the convolutional neural network by using video or audio as output to finish the final search content, and feeding back the final search content to the user in a voice mode.
The invention extracts the voice characteristics of multimedia in multiple dimensions through the time sequence neural network, and then searches the video by using the method of comparing the semantic acquired by the convolutional network combination with the user requirement, thereby improving the accuracy of voice search. The technical problems that the existing voice multimedia search system does not perform deep search according to voice information and users have more actual requirements to perform related search aiming at the characteristics in the voice are solved. The deep learning-based system is urgently needed, can complete multimedia resource recommendation in most of actual life, better optimizes user experience and improves applicability.
Although the present invention has been described herein with reference to the illustrated embodiments thereof, which are intended to be preferred embodiments of the present invention, it is to be understood that the invention is not limited thereto, and that numerous other modifications and embodiments can be devised by those skilled in the art that will fall within the spirit and scope of the principles of this disclosure.

Claims (7)

1. A video searching method based on voice characteristics is characterized in that: the method comprises the following steps:
step 1) extracting voice data to be searched into a plurality of multi-dimensional characteristic vectors;
step 2) converting the extracted voice data of the multi-dimensional feature vectors into a plurality of texts;
step 3) importing the plurality of texts obtained in the step 2) into a time sequence neural network for training to obtain a high-latitude feature vector;
step 4), the convolution neural network takes the characteristic vector of high latitude as input to carry out regression operation;
step 5) comparing the characteristic vectors after the convolution neural network regression operation with the characteristic vectors corresponding to the database, selecting the result with the highest similarity, and testing the selected result with the highest similarity to obtain the final accuracy;
and 6) outputting the result with the highest similarity selected in the step 5) by using video or audio, finishing the final search content, and feeding back the final search content to the user in a voice mode.
2. The video searching method based on the voice feature of claim 1, wherein: the training set of the sequential neural network employs a CA8 training set.
3. The video searching method based on the voice feature of claim 2, wherein: the samples of the CA8 training set are as follows: 8: 1: and (4) randomly cutting at a ratio of 1.
4. The video searching method based on the voice feature of claim 1, wherein: the time sequence neural network comprises 5 gate control neural units which are connected in series.
5. The video searching method based on the voice feature of claim 4, wherein: the interior of the gated neural unit includes a reset gate and an update gate.
6. The video searching method based on the voice feature of claim 1, wherein: the convolutional neural network is arranged in a U-shaped structure, namely a decoder-encoder structure.
7. The video searching method based on the voice feature of claim 1, wherein: the regression operation flow of the convolutional neural network comprises the following steps: the decoder divides the acquired high-latitude characteristic vector into a plurality of linearly independent sub-dimension vectors, and obtains a middle vector of the plurality of sub-dimension vectors through convolution of a plurality of convolution layers with different dimensions; the encoder takes the intermediate vector as input, and the convolutional neural network carries out convolution combination to restore the original new characteristic diagram.
CN202010604506.9A 2020-06-29 2020-06-29 Video searching method based on voice characteristics Active CN111753137B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010604506.9A CN111753137B (en) 2020-06-29 2020-06-29 Video searching method based on voice characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010604506.9A CN111753137B (en) 2020-06-29 2020-06-29 Video searching method based on voice characteristics

Publications (2)

Publication Number Publication Date
CN111753137A true CN111753137A (en) 2020-10-09
CN111753137B CN111753137B (en) 2022-05-03

Family

ID=72676733

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010604506.9A Active CN111753137B (en) 2020-06-29 2020-06-29 Video searching method based on voice characteristics

Country Status (1)

Country Link
CN (1) CN111753137B (en)

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107193983A (en) * 2017-05-27 2017-09-22 北京小米移动软件有限公司 Image search method and device
CN107577985A (en) * 2017-07-18 2018-01-12 南京邮电大学 The implementation method of the face head portrait cartooning of confrontation network is generated based on circulation
CN109446332A (en) * 2018-12-25 2019-03-08 银江股份有限公司 A kind of people's mediation case classification system and method based on feature migration and adaptive learning
US10261849B1 (en) * 2017-08-11 2019-04-16 Electronics Arts Inc. Preventative remediation of services
US20190156159A1 (en) * 2017-11-20 2019-05-23 Kavya Venkata Kota Sai KOPPARAPU System and method for automatic assessment of cancer
CN109979429A (en) * 2019-05-29 2019-07-05 南京硅基智能科技有限公司 A kind of method and system of TTS
CN110059201A (en) * 2019-04-19 2019-07-26 杭州联汇科技股份有限公司 A kind of across media program feature extracting method based on deep learning
CN110990597A (en) * 2019-12-19 2020-04-10 中国电子科技集团公司信息科学研究院 Cross-modal data retrieval system based on text semantic mapping and retrieval method thereof
CN111078831A (en) * 2019-11-06 2020-04-28 广州荔支网络技术有限公司 Optimization method for converting audio content into text in text reading
CN111126059A (en) * 2019-12-24 2020-05-08 上海风秩科技有限公司 Method and device for generating short text and readable storage medium
CN111191075A (en) * 2019-12-31 2020-05-22 华南师范大学 Cross-modal retrieval method, system and storage medium based on dual coding and association
CN111241996A (en) * 2020-01-09 2020-06-05 桂林电子科技大学 Method for identifying human motion in video
CN111312228A (en) * 2019-12-09 2020-06-19 中国南方电网有限责任公司 End-to-end-based voice navigation method applied to electric power enterprise customer service
CN111488871A (en) * 2019-01-25 2020-08-04 斯特拉德视觉公司 Method and apparatus for switchable mode R-CNN based monitoring

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107193983A (en) * 2017-05-27 2017-09-22 北京小米移动软件有限公司 Image search method and device
CN107577985A (en) * 2017-07-18 2018-01-12 南京邮电大学 The implementation method of the face head portrait cartooning of confrontation network is generated based on circulation
US10261849B1 (en) * 2017-08-11 2019-04-16 Electronics Arts Inc. Preventative remediation of services
US20190156159A1 (en) * 2017-11-20 2019-05-23 Kavya Venkata Kota Sai KOPPARAPU System and method for automatic assessment of cancer
CN109446332A (en) * 2018-12-25 2019-03-08 银江股份有限公司 A kind of people's mediation case classification system and method based on feature migration and adaptive learning
CN111488871A (en) * 2019-01-25 2020-08-04 斯特拉德视觉公司 Method and apparatus for switchable mode R-CNN based monitoring
CN110059201A (en) * 2019-04-19 2019-07-26 杭州联汇科技股份有限公司 A kind of across media program feature extracting method based on deep learning
CN109979429A (en) * 2019-05-29 2019-07-05 南京硅基智能科技有限公司 A kind of method and system of TTS
CN111078831A (en) * 2019-11-06 2020-04-28 广州荔支网络技术有限公司 Optimization method for converting audio content into text in text reading
CN111312228A (en) * 2019-12-09 2020-06-19 中国南方电网有限责任公司 End-to-end-based voice navigation method applied to electric power enterprise customer service
CN110990597A (en) * 2019-12-19 2020-04-10 中国电子科技集团公司信息科学研究院 Cross-modal data retrieval system based on text semantic mapping and retrieval method thereof
CN111126059A (en) * 2019-12-24 2020-05-08 上海风秩科技有限公司 Method and device for generating short text and readable storage medium
CN111191075A (en) * 2019-12-31 2020-05-22 华南师范大学 Cross-modal retrieval method, system and storage medium based on dual coding and association
CN111241996A (en) * 2020-01-09 2020-06-05 桂林电子科技大学 Method for identifying human motion in video

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
卢泓宇等: "卷积神经网络特征重要性分析及增强特征选择模型", 《软件学报》 *

Also Published As

Publication number Publication date
CN111753137B (en) 2022-05-03

Similar Documents

Publication Publication Date Title
CN111191078B (en) Video information processing method and device based on video information processing model
CN108920497B (en) Man-machine interaction method and device
CN108959627B (en) Question-answer interaction method and system based on intelligent robot
US11151191B2 (en) Video content segmentation and search
CN112989212B (en) Media content recommendation method, device and equipment and computer storage medium
CN111831789A (en) Question-answer text matching method based on multilayer semantic feature extraction structure
CN111552799A (en) Information processing method, information processing device, electronic equipment and storage medium
CN114298121A (en) Multi-mode-based text generation method, model training method and device
CN115495555A (en) Document retrieval method and system based on deep learning
CN113672708A (en) Language model training method, question and answer pair generation method, device and equipment
CN113806588B (en) Method and device for searching video
CN112650842A (en) Human-computer interaction based customer service robot intention recognition method and related equipment
CN113392265A (en) Multimedia processing method, device and equipment
CN115293348A (en) Pre-training method and device for multi-mode feature extraction network
CN116306603A (en) Training method of title generation model, title generation method, device and medium
CN112085120A (en) Multimedia data processing method and device, electronic equipment and storage medium
CN116662565A (en) Heterogeneous information network keyword generation method based on contrast learning pre-training
CN114120166A (en) Video question and answer method and device, electronic equipment and storage medium
CN116977701A (en) Video classification model training method, video classification method and device
CN111753137B (en) Video searching method based on voice characteristics
CN110941958A (en) Text category labeling method and device, electronic equipment and storage medium
CN112861580A (en) Video information processing method and device based on video information processing model
CN113553844B (en) Domain identification method based on prefix tree features and convolutional neural network
CN115062123A (en) Knowledge base question-answer pair generation method of conversation generation system
CN114565804A (en) NLP model training and recognizing system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant