
- 2021.12.11
# Keras==2.2.5 tensorflow==1.15.0实验内容
- 构建音频数据集,并实现一个触发字检测(唤醒词检测)算法
- 本次实验的触发词为 “activate”,每次听到一个 “activate”,算法都会触发一个响声
- 我们规定说出 “activate” 为 “positive”,其他情况下都为 “negative”
构建一个在不同环境下说出 “activate” 和 其他词的数据集
现有数据- 在不同环境下的背景噪音
- 包含"positive / negative" 词的音频片段(包括不同的方言)
- 总的来说就是有三种音频片段
- “background noise”
- “positive words”
- “negative words”
- 我们将利用以上三种不同的片段来合成音频数据集
-
音频实际上是由麦克风记录气压变化产生的
-
可以将一段音频想象为一串用于记录气压变化的数
-
我们使用的音频一秒返回44100个数
-
我们很难从音频的“原始”表现中判断“激活”这个词是否被说出来了。为了帮助您的序列模型更容易地学习检测触发词,我们将计算音频的声谱图。声谱图告诉我们一个音频片段在某一时刻有多少不同的频率
x = graph_spectrogram("audio_examples/example_train.wav")
蓝色的代表出现瓶比较小,绿色的代表出现频率比较大
声谱图将会作为网络的输入 x x x , T x = 5511 T_x = 5511 Tx=5511(一个声谱图有5511个时间步)
_, data = wavfile.read("audio_examples/example_train.wav")
print("Time steps in audio recording before spectrogram", data[:,0].shape)
print("Time steps in input after spectrogram", x.shape)
>>Time steps in audio recording before spectrogram (441000,)
>>Time steps in input after spectrogram (101, 5511)
这样我们可以定义一些声谱图的时间步数,和一个时间步中的频率数
Tx = 5511 # The number of time steps input to the model from the spectrogram n_freq = 101 # Number of frequencies input to the model at each time step of the spectrogram
我们定义 G R U GRU GRU 模型的输出 T y = 1375 T_y = 1375 Ty=1375,这意味着我们利用 G R U GRU GRU 将一个十秒的音频分成 1375 个时间段,并且尝试从每一个时间段中来判断,该段是否含有 “activate”
Ty = 1375 # The number of time steps in the output of our modelGenerating a single training example
因为speech data难获取并且难标注,因此我们选择利用上述三种音频来合成。合成一个训练样本可以分为以下三步
- 选一个十秒的背景音频
- 随机插入 0 — 4 段 “activate” 的音频片段
- 随机插入 0 — 2 段 “negative words”的音频片段
因为我们是插入的音频片段,所以我们知道 “activate” 片段的位置,这样就很容易进行标注
我们利用 pydub 包来处理音频,pydub将 1ms 作为一个离散的时间间隔(10s = 10000ms),这也是我们为什么将一个十秒的片段表示为10000个step
# Load audio segments using pydub
activates, negatives, backgrounds = load_raw_audio()
print("background len: " + str(len(backgrounds[0]))) # Should be 10,000, since it is a 10 sec clip
print("activate[0] len: " + str(len(activates[0]))) # Maybe around 1000, since an "activate" audio clip is usually around 1 sec (but varies a lot)
print("activate[1] len: " + str(len(activates[1]))) # Different "activate" clips can have different lengths
>>background len: 10000
>>activate[0] len: 721
>>activate[1] len: 731
注意:我们在向背景noise中添加片段的时候,添加片段的位置不能与已经存在的片段的位置有重叠
背景音频的标签全设为 0 ,当插入一段 “activate” 音频时,我们将后面的 50 个step的标签都设为 1
我们可以利用四个函数
用于随机产生一段音频的起始位置和终止位置
- 输入:随机产生的片段的长度
- 输出:随机产生片段的起始位置和终止位置
def get_random_time_segment(segment_ms):
"""
Gets a random time segment of duration segment_ms in a 10,000 ms audio clip.
Arguments:
segment_ms -- 产生片段的长度
Returns:
segment_time -- a tuple of (segment_start, segment_end) in ms
"""
segment_start = np.random.randint(low=0, high=10000-segment_ms) # Make sure segment doesn't run past the 10sec background
segment_end = segment_start + segment_ms - 1
return (segment_start, segment_end)
第二个函数
用来判断即将新加入的片段是否和已经加入的片段有重合
- 输入
- 要插入的音频的起始位置和终止位置
- 已经存在的音频片段的位置
def is_overlapping(segment_time, previous_segments):
"""
Checks if the time of a segment overlaps with the times of existing segments.
Arguments:
segment_time -- a tuple of (segment_start, segment_end) for the new segment
previous_segments -- a list of tuples of (segment_start, segment_end) for the existing segments
Returns:
True if the time segment overlaps with any of the existing segments, False otherwise
"""
segment_start, segment_end = segment_time
### START CODE HERE ### (≈ 4 line)
# Step 1: Initialize overlap as a "False" flag. (≈ 1 line)
overlap = False
# Step 2: loop over the previous_segments start and end times.
# Compare start/end times and set the flag to True if there is an overlap (≈ 3 lines)
if overlap == False:
for previous_start, previous_end in previous_segments:
if previous_start <= segment_end and previous_end >= segment_start:
overlap = True
### END CODE HERE ###
return overlap
第三个函数
用来向背景音频中插入其他类型的音频,该过程可以用四步总结
-
输入
- 背景音频
- 想要插入的音频片段
- 已经存在的音频片段
-
输出
- 插入音频与背景音频合并之后的新的背景音频
- 新插入音频的起始位置和终止位置
-
首先利用之前的函数,创建一段随机的插入位置
-
判断插入位置与之前存在的片段是否产生重叠,产生随机片段一直到没有重叠区域为止
-
将新插入的片段插入到已存在的片段中
-
进行音乐片段的剪辑
# GRADED FUNCTION: insert_audio_clip
def insert_audio_clip(background, audio_clip, previous_segments):
"""
Insert a new audio segment over the background noise at a random time step, ensuring that the
audio segment does not overlap with existing segments.
Arguments:
background -- a 10 second background audio recording.
audio_clip -- the audio clip to be inserted/overlaid.
previous_segments -- times where audio segments have already been placed
Returns:
new_background -- the updated background audio
"""
# Get the duration of the audio clip in ms
segment_ms = len(audio_clip)
print(segment_ms)
### START CODE HERE ###
# Step 1: Use one of the helper functions to pick a random time segment onto which to insert
# the new audio clip. (≈ 1 line)
segment_time = get_random_time_segment(segment_ms)
# Step 2: Check if the new segment_time overlaps with one of the previous_segments. If so, keep
# picking new segment_time at random until it doesn't overlap. (≈ 2 lines)
while is_overlapping(segment_time , previous_segments):
segment_time = get_random_time_segment(segment_ms)
# Step 3: Add the new segment_time to the list of previous_segments (≈ 1 line)
previous_segments.append(segment_time)
### END CODE HERE ###
# Step 4: Superpose audio segment and background
new_background = background.overlay(audio_clip, position = segment_time[0])
return new_background, segment_time
第四个函数
将activate样本的标签设为1
- 输入:
- 目前的标签向量 y
- 插入音频的结尾位置
# GRADED FUNCTION: insert_ones
def insert_ones(y, segment_end_ms):
"""
Update the label vector y. The labels of the 50 output steps strictly after the end of the segment
should be set to 1. By strictly we mean that the label of segment_end_y should be 0 while, the
50 followinf labels should be ones.
Arguments:
y -- numpy array of shape (1, Ty), the labels of the training example
segment_end_ms -- the end time of the segment in ms
Returns:
y -- updated labels
"""
# duration of the background (in terms of spectrogram time-steps)
segment_end_y = int(segment_end_ms * Ty / 10000.0)
# Add 1 to the correct index in the background label (y)
### START CODE HERE ### (≈ 3 lines)
for i in range(segment_end_y+1 , segment_end_y+51):
if i < Ty:
y[0, i] = 1.0
### END CODE HERE ###
return y
利用上述函数,构建训练数据样例
将 “activate” 音频和 “negative” 音频都插入 background 中
- 将向量 y y y 初始化为大小为 ( 1 , T y ) (1,T_y) (1,Ty) 的 0 向量
- 将现存的片段集合初始化为空
- 随机插入 0 — 4 条 “activate” 音频片段,并且将标签 y y y 的对应位置设置为 1
- 随机插入 0 — 2 条 “negative” 音频片段
# GRADED FUNCTION: create_training_example
def create_training_example(background, activates, negatives):
"""
Creates a training example with a given background, activates, and negatives.
Arguments:
background -- a 10 second background audio recording
activates -- a list of audio segments of the word "activate"
negatives -- a list of audio segments of random words that are not "activate"
Returns:
x -- the spectrogram of the training example
y -- the label at each time step of the spectrogram
"""
# Set the random seed
np.random.seed(18)
# Make background quieter
background = background - 20
### START CODE HERE ###
# Step 1: Initialize y (label vector) of zeros (≈ 1 line)
y = np.zeros((1,Ty))
# Step 2: Initialize segment times as empty list (≈ 1 line)
previous_segments = []
### END CODE HERE ###
# Select 0-4 random "activate" audio clips from the entire list of "activates" recordings
number_of_activates = np.random.randint(0, 5)
random_indices = np.random.randint(len(activates), size=number_of_activates)
random_activates = [activates[i] for i in random_indices]
### START CODE HERE ### (≈ 3 lines)
# Step 3: Loop over randomly selected "activate" clips and insert in background
for random_activate in random_activates:
# Insert the audio clip on the background
background, segment_time = insert_audio_clip(background, random_activate, previous_segments)
# Retrieve segment_start and segment_end from segment_time
segment_start, segment_end = segment_time
# Insert labels in "y"
y = insert_ones(y, segment_end)
### END CODE HERE ###
# Select 0-2 random negatives audio recordings from the entire list of "negatives" recordings
number_of_negatives = np.random.randint(0, 3)
random_indices = np.random.randint(len(negatives), size=number_of_negatives)
random_negatives = [negatives[i] for i in random_indices]
### START CODE HERE ### (≈ 2 lines)
# Step 4: Loop over randomly selected negative clips and insert in background
for random_negative in random_negatives:
# Insert the audio clip on the background
background, _ = insert_audio_clip(background, random_negative, previous_segments)
### END CODE HERE ###
# Standardize the volume of the audio clip
background = match_target_amplitude(background, -20.0)
# Export new training example
file_handle = background.export("train" + ".wav", format="wav")
print("File (train.wav) was saved in your directory.")
# Get and plot spectrogram of the new recording (background with superposition of positive and negatives)
x = graph_spectrogram("train.wav")
return x, y
-
调用结果
x, y = create_training_example(backgrounds[0], activates, negatives)
加载进来已经产生好的训练集
# Load preprocessed training examples
X = np.load("./XY_train/X.npy")
Y = np.load("./XY_train/Y.npy")
Development set
加载进来25个 十秒的录制的真实音频,分布和测试集类似
# Load preprocessed dev set examples
X_dev = np.load("./XY_dev/X_dev.npy")
Y_dev = np.load("./XY_dev/Y_dev.npy")
Model
导入需要的库
Build the modelfrom keras.callbacks import ModelCheckpoint from keras.models import Model, load_model, Sequential from keras.layers import Dense, Activation, Dropout, Input, Masking, TimeDistributed, LSTM, Conv1D from keras.layers import GRU, Bidirectional, BatchNormalization, Reshape from keras.optimizers import Adam
CONV-1D的输入是 5511 个时间步的声谱,输出 1375 个时间步,用于后来分析出
T
y
=
1375
T_y = 1375
Ty=1375,也就是判断这 1375 个时间步是否含有 “activate”
构造这个模型可以分为 4 步
- 实现卷积,利用 Conv1D() 函数,其中有 196 filters,filter size = 15 , stride = 4
- 产生第一个 GRU 层,通过 X = GRU(units = 128, return_sequences = True)(X)
- return_sequences = True ,代表 GRU 的 hidden states 都会被传递给下一层
- 产生第二个 GRU 层,和上一层类似,只不过多了一层 Dropout 层
- 创建一个时间分布的致密层,通过 X = TimeDistributed(Dense(1, activation = “sigmoid”))(X)
# GRADED FUNCTION: model
def model(input_shape):
"""
Function creating the model's graph in Keras.
Argument:
input_shape -- shape of the model's input data (using Keras conventions)
Returns:
model -- Keras model instance
"""
X_input = Input(shape = input_shape)
### START CODE HERE ###
# Step 1: CONV layer (≈4 lines)
X = Conv1D(196, 15, strides=4)(X_input) # CONV1D
X = BatchNormalization()(X) # Batch normalization
X = Activation('relu')(X) # ReLu activation
X = Dropout(rate=0.8)(X) # dropout (use 0.8)
# Step 2: First GRU Layer (≈4 lines)
X = GRU(units = 128, return_sequences=True)(X) # GRU (use 128 units and return the sequences)
X = Dropout(rate=0.8)(X) # dropout (use 0.8)
X = BatchNormalization()(X) # Batch normalization
# Step 3: Second GRU Layer (≈4 lines)
X = GRU(units = 128, return_sequences=True)(X) # GRU (use 128 units and return the sequences)
X = Dropout(rate=0.8)(X) # dropout (use 0.8)
X = BatchNormalization()(X) # Batch normalization
X = Dropout(rate=0.8)(X) # dropout (use 0.8)
# Step 4: Time-distributed dense layer (≈1 line)
X = TimeDistributed(Dense(1, activation = "sigmoid"))(X) # time distributed (sigmoid)
### END CODE HERE ###
model = Model(inputs = X_input, outputs = X)
return model
模型创建和记录
model = model(input_shape = (Tx, n_freq)) # 创建 model.summary() # 记录Fit the model
加载进来已经训练好的模型
model = load_model('./models/tr_model.h5')
train the model
opt = Adam(lr=0.0001, beta_1=0.9, beta_2=0.999, decay=0.01) model.compile(loss='binary_crossentropy', optimizer=opt, metrics=["accuracy"]) model.fit(X, Y, batch_size = 5, epochs=1)Test the model
loss, acc = model.evaluate(X_dev, Y_dev)
print("Dev set accuracy = ", acc)
requirements — conda
# This file may be used to create an environment using: # $ conda create --name--file # platform: win-64 _tflow_select=2.2.0=eigen absl-py=0.13.0=py36haa95532_0 astor=0.8.1=py36haa95532_0 blas=1.0=mkl ca-certificates=2021.10.26=haa95532_2 certifi=2016.2.28=py36_0 colorama=0.3.9=py36_0 coverage=4.4.1=py36_0 cycler=0.10.0=py36_0 cython=0.26=py36_0 decorator=4.1.2=py36_0 freetype=2.10.4=hd328e21_0 gast=0.2.2=py36_0 google-pasta=0.2.0=pyhd3eb1b0_0 grpcio=1.36.1=py36hc60d5dd_1 h5py=2.10.0=py36h5e291fa_0 hdf5=1.10.4=h7ebc959_0 icc_rt=2019.0.0=h0cc432a_1 icu=58.2=ha925a31_3 importlib-metadata=4.8.1=py36haa95532_0 intel-openmp=2021.4.0=haa95532_3556 ipykernel=4.6.1=py36_0 ipython=6.1.0=py36_0 ipython_genutils=0.2.0=py36_0 jedi=0.10.2=py36_2 jpeg=9d=h2bbff1b_0 jupyter_client=5.1.0=py36_0 jupyter_core=4.3.0=py36_0 keras=2.2.5=py36_1 keras-applications=1.0.8=py_1 keras-preprocessing=1.1.2=pyhd3eb1b0_0 kiwisolver=1.3.1=py36hd77b12b_0 libgpuarray=0.7.6=hfa6e2cd_0 libpng=1.6.37=h2a8f88b_0 libprotobuf=3.17.2=h23ce68f_1 libpython=2.0=py36_0 m2w64-binutils=2.25.1=5 m2w64-bzip2=1.0.6=6 m2w64-crt-git=5.0.0.4636.2595836=2 m2w64-gcc=5.3.0=6 m2w64-gcc-ada=5.3.0=6 m2w64-gcc-fortran=5.3.0=6 m2w64-gcc-libgfortran=5.3.0=6 m2w64-gcc-libs=5.3.0=7 m2w64-gcc-libs-core=5.3.0=7 m2w64-gcc-objc=5.3.0=6 m2w64-gmp=6.1.0=2 m2w64-headers-git=5.0.0.4636.c0ad18a=2 m2w64-isl=0.16.1=2 m2w64-libiconv=1.14=6 m2w64-libmangle-git=5.0.0.4509.2e5a9a2=2 m2w64-libwinpthread-git=5.0.0.4634.697f757=2 m2w64-make=4.1.2351.a80a8b8=2 m2w64-mpc=1.0.3=3 m2w64-mpfr=3.1.4=4 m2w64-pkg-config=0.29.1=2 m2w64-toolchain=5.3.0=7 m2w64-tools-git=5.0.0.4592.90b8472=2 m2w64-windows-default-manifest=6.4=3 m2w64-winpthreads-git=5.0.0.4634.697f757=2 m2w64-zlib=1.2.8=10 mako=1.0.6=py36_0 markdown=3.3.4=py36haa95532_0 markupsafe=1.0=py36_0 matplotlib=3.2.2=1 matplotlib-base=3.2.2=py36hfa737b6_1 mkl=2020.2=256 mkl-service=2.3.0=py36h196d8e1_0 mkl_fft=1.3.0=py36h46781fe_0 mkl_random=1.1.1=py36h47e9c7a_0 msys2-conda-epoch=20160418=1 numpy=1.19.2=py36hadc3359_0 numpy-base=1.19.2=py36ha3acd2a_0 openssl=1.1.1l=h2bbff1b_0 opt_einsum=3.3.0=pyhd3eb1b0_1 path.py=10.3.1=py36_0 pickleshare=0.7.4=py36_0 pip=9.0.1=py36_1 prompt_toolkit=1.0.15=py36_0 protobuf=3.17.2=py36hd77b12b_0 pydub=0.25.1=pyhd8ed1ab_0 pygments=2.2.0=py36_0 pygpu=0.7.6=py36h2a96729_0 pyparsing=2.2.0=py36_0 pyqt=5.9.2=py36h6538335_2 pyreadline=2.1=py36_0 python=3.6.13=h3758d61_0 python-dateutil=2.6.1=py36_0 python_abi=3.6=2_cp36m pyyaml=3.12=py36_0 pyzmq=16.0.2=py36_0 qt=5.9.7=vc14h73c81de_0 scipy=1.5.2=py36h9439919_0 setuptools=36.4.0=py36_1 simplegeneric=0.8.1=py36_1 sip=4.19.8=py36h6538335_0 six=1.16.0=pyhd3eb1b0_0 sqlite=3.36.0=h2bbff1b_0 tensorboard=1.15.0=pyhb230dea_0 tensorflow=1.15.0=eigen_py36h932cce6_0 tensorflow-base=1.15.0=eigen_py36h07d2309_0 tensorflow-estimator=2.6.0=pyh7b7c402_0 termcolor=1.1.0=py36_0 theano=0.9.0=py36_0 tornado=4.5.2=py36_0 traitlets=4.3.2=py36_0 typing_extensions=3.10.0.2=pyh06a4308_0 vc=14.2=h21ff451_1 vs2015_runtime=14.27.29016=h5e58377_2 wcwidth=0.1.7=py36_0 webencodings=0.5.1=py36_1 werkzeug=0.16.1=py_0 wheel=0.29.0=py36_0 wincertstore=0.2=py36_0 wrapt=1.12.1=py36he774522_1 zipp=3.6.0=pyhd3eb1b0_0 zlib=1.2.11=h62dcd97_4
欢迎分享,转载请注明来源:内存溢出
微信扫一扫
支付宝扫一扫
评论列表(0条)