CN111753137A

CN111753137A - Video searching method based on voice characteristics

Info

Publication number: CN111753137A
Application number: CN202010604506.9A
Authority: CN
Inventors: 梁敏
Original assignee: Sichuan Changhong Electric Co Ltd
Current assignee: Sichuan Changhong Electric Co Ltd
Priority date: 2020-06-29
Filing date: 2020-06-29
Publication date: 2020-10-09
Anticipated expiration: 2040-06-29
Also published as: CN111753137B

Abstract

The invention discloses a video searching method based on voice characteristics, which comprises the following steps: extracting voice data to be searched into a plurality of multi-dimensional feature vectors; converting the extracted voice data of the multi-dimensional feature vector into a plurality of texts; importing the obtained texts into a time sequence neural network for training to obtain a feature vector of high latitude; the convolution neural network takes the feature vector of high latitude as input to carry out regression operation; comparing the characteristic vectors obtained by the convolutional neural network with the characteristic vectors corresponding to the database, selecting the result with the highest similarity, and testing to obtain the final accuracy; and outputting the selected result with the highest similarity by using video or audio, finishing the final search content, and feeding back the final search content to the user in a voice mode. The invention solves the technical problem that the accuracy of voice search is not high because the voice information change is tested badly and a label of enough data is difficult to be marked to ensure the precision of a search algorithm in the prior art.

Description

Video searching method based on voice characteristics

Technical Field

The invention relates to the field of artificial intelligence computer vision processing, in particular to a video searching method based on voice characteristics.

Background

The rapid development of the internet has promoted the economic and more intelligent traditional entity, and people are beginning to generate more demands close to life. As one of the mainstream communication applications of human-computer interaction in the intelligent terminal, the use frequency of voice interaction in reality is higher and higher, and the frequency of using the intelligent terminal to search multimedia resources is higher and higher. The existing intelligent terminal firstly converts voice into characters, fuzzy matching is carried out on the characters and the fields in the library file for searching, the corresponding semantics can be found by the fields in the database, and the corresponding semantics can not be found by the fields in the database. This method cannot search and recommend through the characteristics (voice, picture) of the multimedia resource itself, thus deteriorating the experience of the user who has such needs. Most often, in life, people want to search for media assets concerned by themselves, but do not know the name of a film source, and in such a case, equipment is needed to be capable of deeply analyzing voice information to recommend and search media.

For example, when a user searches for a movie of movie a, the current search method needs to know the name of the movie and yell "i want to watch movie a" with voice, the device receives the voice and then converts the voice recognition into characters, and after analyzing semantics, searches for the corresponding movie of the movie a; many times, however, the user does not know the title of the series and the user may only remember certain features, such as: movie a is the first television show of an actor 2016, and when(s) he shouts "i want to watch the first television show of the actor 2016", the device analyzes the speech by fields, and compares the fields in the database, and the corresponding video media cannot be searched because of the semantic meaning without matching the fields.

Disclosure of Invention

The invention aims to provide a video searching method based on voice characteristics, which is used for solving the technical problem that in the prior art, voice information change is not measured, and a label of enough data is difficult to be marked to ensure the precision of a searching algorithm, so that the accuracy of voice searching is not high.

The invention solves the problems through the following technical scheme:

a video searching method based on voice characteristics comprises the following steps:

step 1) extracting voice data to be searched into a plurality of multi-dimensional characteristic vectors;

step 2) converting the extracted voice data of the multi-dimensional feature vectors into a plurality of texts;

step 3) importing the plurality of texts obtained in the step 2) into a time sequence neural network for training to obtain a high-latitude feature vector;

step 4), the convolution neural network takes the characteristic vector of high latitude as input to carry out regression operation;

step 5) comparing the characteristic vectors after the convolution neural network regression operation with the characteristic vectors corresponding to the database, selecting the result with the highest similarity, and testing the selected result with the highest similarity to obtain the final accuracy;

and 6) outputting the result with the highest similarity selected in the step 5) by using video or audio, finishing the final search content, and feeding back the final search content to the user in a voice mode.

Preferably, the training set of the time-series neural network employs a CA8 training set.

Preferably, the samples of the CA8 training set are according to a training set, a validation set, and a training set: 8: 1: and (4) randomly cutting at a ratio of 1.

Preferably, the time-series neural network comprises 5 gating neural units which are connected in series.

Preferably, the interior of the gated neural unit includes a reset gate and an update gate.

Preferably, the convolutional neural network is arranged in a U-shaped structure, i.e., a decoder-encoder structure.

Preferably, the regression operation flow of the convolutional neural network is as follows: the decoder divides the acquired high-latitude characteristic vector into a plurality of linearly independent sub-dimension vectors, and obtains a middle vector of the plurality of sub-dimension vectors through convolution of a plurality of convolution layers with different dimensions; the encoder takes the intermediate vector as input and trains the convolutional neural network to carry out convolution combination to restore the original new characteristic diagram.

Compared with the prior art, the invention has the following advantages and beneficial effects:

the invention extracts the voice characteristics of multimedia in multiple dimensions through the time sequence neural network, and then searches the video by using the method of comparing the semantic acquired by the convolutional network combination with the user requirement, thereby improving the accuracy of voice search. The technical problems that the existing voice multimedia search system does not perform deep search according to voice information and users have more actual requirements to perform related search aiming at the characteristics in the voice are solved. The deep learning-based system is urgently needed, can complete multimedia resource recommendation in most of actual life, better optimizes user experience and improves applicability.

Drawings

FIG. 1 is an internal structure of a single gated neural unit of the present invention.

Fig. 2 shows a decoder-encoder structure according to the present invention.

FIG. 3 is an exemplary diagram of audio feature vectors according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples, but the embodiments of the present invention are not limited thereto.

Example 1:

referring to fig. 1 to 3, a video search method based on voice features includes the following steps:

step 1) cutting the voice information to be searched into a plurality of small segments, extracting each segment of voice information into a feature vector, wherein each feature vector is a multi-dimensional feature vector, as shown in fig. 3.

And 2) converting all the extracted multi-dimensional feature vectors into texts.

Step 3), setting a training set of the time-series neural network as a CA8 training set, wherein samples of the CA8 training set are as follows: 8: 1: 1, random cutting in proportion, firstly initializing the parameters of a reset gate and an update gate of a time sequence neural network, then importing the text converted in the step 2) into the time sequence neural network, managing and training data of an entering unit by using the reset gate inside a gated neural unit, transferring the data required to be transferred to a next unit to a next component through the update gate after internal updating is completed, extracting the relevant text characteristics of the imported text in the step 2) by using 5 continuously connected gated neural units, and finally obtaining a characteristic vector with high latitude, wherein the structure of each gated neural unit is shown in figure 1.

Step 4), designing the convolutional neural network into a U-shaped structure, as shown in fig. 2, namely a structure of a decoder-encoder, designing a plurality of convolutional layers with different dimensions by the decoder, wherein the convolutional layers are respectively 64 × 64, 32 × 32, 16 × 16 and 8 × 8, splitting the high-latitude characteristic vector obtained in the step 3) into a plurality of linear independent sub-dimension vectors by the decoder, and finally obtaining a middle vector of the plurality of sub-dimension vectors through the plurality of convolutional layers with different dimensions by the decoder; the encoder takes the intermediate vector obtained by the decoder as input, restores the original new characteristic diagram through the convolution layers with different dimensionalities of the encoder according to the modular length and the direction of the intermediate vector, and performs characteristic diagram fusion by utilizing an upper sampling layer to obtain a final characteristic diagram.

And 5) comparing the final feature map with the feature maps in the specific database for similarity, selecting a result with the highest similarity, verifying the result with the highest similarity, and testing the final accuracy.

And 6) outputting the result with the highest similarity selected by the convolutional neural network by using video or audio as output to finish the final search content, and feeding back the final search content to the user in a voice mode.

Although the present invention has been described herein with reference to the illustrated embodiments thereof, which are intended to be preferred embodiments of the present invention, it is to be understood that the invention is not limited thereto, and that numerous other modifications and embodiments can be devised by those skilled in the art that will fall within the spirit and scope of the principles of this disclosure.

Claims

1. A video searching method based on voice characteristics is characterized in that: the method comprises the following steps:

2. The video searching method based on the voice feature of claim 1, wherein: the training set of the sequential neural network employs a CA8 training set.

3. The video searching method based on the voice feature of claim 2, wherein: the samples of the CA8 training set are as follows: 8: 1: and (4) randomly cutting at a ratio of 1.

4. The video searching method based on the voice feature of claim 1, wherein: the time sequence neural network comprises 5 gate control neural units which are connected in series.

5. The video searching method based on the voice feature of claim 4, wherein: the interior of the gated neural unit includes a reset gate and an update gate.

6. The video searching method based on the voice feature of claim 1, wherein: the convolutional neural network is arranged in a U-shaped structure, namely a decoder-encoder structure.

7. The video searching method based on the voice feature of claim 1, wherein: the regression operation flow of the convolutional neural network comprises the following steps: the decoder divides the acquired high-latitude characteristic vector into a plurality of linearly independent sub-dimension vectors, and obtains a middle vector of the plurality of sub-dimension vectors through convolution of a plurality of convolution layers with different dimensions; the encoder takes the intermediate vector as input, and the convolutional neural network carries out convolution combination to restore the original new characteristic diagram.