Training kaldi models with custom features
Kaldi Speech Recognition Toolkit is a freely available toolkit that offers several tools for conducting research on automatic speech recognition (ASR). It lets us train an ASR system from scratch all the way from the feature extraction (MFCC,FBANK, ivector, FMLLR,…), GMM and DNN acoustic model training, to the decoding using advanced language models, and produce state-of-the-art results.
While kaldi offers so much flexibilty at every stage, sometimes we also need to play with features that are not offered by the kaldi repository. Kaldi makes use of ark format to store the features. If we want to perform experiments with customized features, they must be converted to the ark format first. The goal of this post is to explain how we can extract and store the custom features in the ark format using matlab and python.
First Steps
In this tutorial, we will be starting with a data
folder that is generated by kaldi, in order to avoid directory validation errors (which is performed by kaldi before training any model). So we we will run the kaldi routine to prepare the data folders (typically done by local/<datasetname>_data_prep.sh
script). This will generate all the required files such as text
containing the transcriptions, utt2spk
and spk2utt
for CMVN and ivector computation, and wav.scp
for reading the audio files from the disk. Then we will run the feature extraction pipeline in the default kaldi way, say FBANK features. Typically, the data-fbank
folder contains the train
, test
and dev
folder along with the text
, utt2spk
, spk2utt
and wav.scp
files.
Running the make_fbank_feats.sh
will typically store the features in the ark format to the fbank
folder and the corresponding feats.scp
files (used for reading the ark files) will be placed in the respective subfolder in data-fbank
. In order to run the feature extraction in parallel, kaldi typically splits the data into 10 sets and extract them in parallel. Thus the directory structure will typically be:
fbank
├── raw_fbank_dev.1.ark
├── raw_fbank_dev.1.scp
├── ...
├── ...
├── raw_fbank_dev.10.ark
├── raw_fbank_dev.10.scp
├── raw_fbank_test.1.ark
├── raw_fbank_test.1.scp
├── ...
├── ...
├── raw_fbank_test.10.ark
├── raw_fbank_test.10.scp
├── raw_fbank_train.1.ark
├── raw_fbank_train.1.scp
├── ...
├── ...
├── raw_fbank_train.10.ark
└── raw_fbank_train.10.scp
data-fbank
├── dev
│ ├── feats.scp
│ ├── spk2gender
│ ├── spk2utt
│ ├── stm
│ ├── text
│ ├── utt2spk
│ ├── wav.scp
│ └── split10
|
├── test
│ ├── feats.scp
│ ├── spk2gender
│ ├── spk2utt
│ ├── text
│ ├── utt2spk
│ ├── wav.scp
│ └─ split10
|
├── train
├── feats.scp
├── spk2gender
├── spk2utt
├── split10
├── text
├── train
├── utt2spk
└── wav.scp
To summarize the steps so far:
# run the data preparation
bash local/<database>_data_prep.sh
# extract fbank features
mkdir fbank data-fbank
for x in train test dev ; do
cp -r data/$x data-fbank/
steps/make_fbank.sh --nj 10 --fbank-config conf/fbank.sh \
data-fbank/$x exp/make_fbank/$x fbank || exit 1;
done
Notice that feats.scp
is a concatenation of all the .scp
files belonging to the subset (train, dev or test). .scp
files will have two columns: the first column contains the unique id for the utterance and the second column says where the corresponding features are located in the ark file. This tutorial will manipulate the ark files and scp files in order that they are to be used by kaldi for training.
The the next step is to create myfeats
and data-myfeats
directories. These directories will follow the same structure as used by the fbank
and data-fbank
directories, but contains our custom features inside. Noice that the data folder will have the same files except the feats.scp
. feats.scp
has to be modified such that it tells kaldi to read features from the myfeats
folder. The steps are summarized below:
$ mkdir myfeats data-myfeats
$ cp -r data-fbank/* data-myfeats
$ rm data-myfeats/*/feats.scp # we will need to generate new feats.scp files for the custom feats
Now the pre-requisites are done. Next, we can use either MATLAB or python to prepare the custom features and generate the correspoding ark and scp files.
Click here to jump to the python approach
Custom features using MATLAB
Since kaldi makes use of a unique ID to read the ark files, we will need to follow the same format. As mentioned before, we will use the already prepared FBANK features as a reference file to generate our custom features. In order to do this with MATLAB, we will need to read and write the ark files using MATLAB.
The research page of Hynek Boril offers matlab routines for reading and writing ark files. The codes can be found here. It offers three MATLAB files:
- arkread.m : For reading ark files
- arkwrite.m : For writing ark files
- ark2scp.m : For generating scp files from ark files
So the general procedure is follows: Use arkread.m
to read the ark files which will return a header mat and a feature matrix. The header mat contains the unique id of the utterance and the feature size related information. feature mat is a huge matrix that contains all the feature vectors. The idea is to change the feature mat with our custom features, change header mat accordingly and to write it using arkwrite.m
. Then we will use ark2scp.m
to generate the corresponding .scp
file.
A sample MATLAB code is given below.
setlist = {'test', 'train', 'dev'}
myfeatname = 'myfeats';
splits = 10; % the feature extraction is typically split into 10, change this otherwise
fbankdatadir = fullfile(codedir, 'data-fbank');
fbankfeatdir = fullfile(codedir, 'fbank');
datadir = fullfile(codedir, ['data-', myfeatname]);
featdir = fullfile(codedir, myfeatname);
% create dirs
system(['mkdir -p ' datadir ' ' featdir])
for setnum = 1:length(setlist)
setname = setlist{setnum};
% copy files
system(['cp -r ' fbankdatadir '/' setname ' ' datadir '/'])
filelist = fullfile(datadir, setname, 'wav.scp');
fid = fopen(filelist, 'r');
filenames = textscan (fid, '%s %s');
fclose(fid);
numfiles = length(filenames{1});
fileindex = 1;
for splitindex = 1: splits
disp (['Processing ' num2str(splitindex) ' of ' num2str(splits) ' split data.']);
srcark = strcat('raw_fbank_', dtag, '.', num2str(splitindex), '.ark');
srcark_filename = fullfile(fbankfeatdir, srcark);
destark = strcat('raw_', myfeatname, '_', setname, '.', num2str(splitindex), '.ark');
destscp = strcat('raw_', myfeatname, '_', setname, '.', num2str(splitindex), '.scp');
destark_filename = fullfile(featdir, destark);
destscp_filename = fullfile(featdir, destscp);
[HEADER_MAT, FEATURE_MAT] = arkread(srcark_filename);
HEADER_MAT_NEW = cell(size(HEADER_MAT,1), 5);
FEATURE_MAT_NEW = zeros(size(FEATURE_MAT,1), featdim); % expects same frame dimension as FBANK
framestart = 1;
if lt(splitindex, splitstart)
fileindex = fileindex + size(HEADER_MAT, 1);
continue;
end
for filenum = 1 : size(HEADER_MAT, 1)
disp(['File index is ' num2str(fileindex) ' of split set ' num2str(splitindex)])
% verify the id
if ~strcmp(HEADER_MAT{filenum,1}, filenames{1}(fileindex))
error ('ID mismatch!')
end
% read the audio file, the filename is typically the second column in wav.scp
% else, find the right column index in wav.scp and change the cell column below
[x, fs] = audioread(char(filenames{2}(fileindex)));
% Extract custom features here
newfeat = extract_my_feats(x);
% expected shape of newfeat is (numframes x featdim)
% check if it has the same number frames as in FBANK
% else, it is an error if we need to reuse alignments
% if you dont need to use existing alignmnents
% comment out this check
if ne(HEADER_MAT{filenum,2}, size(newfeat,1));
error ('Dimension mismtach!')
end
frameend = framestart + size(newfeat,1) - 1;
% fill in header mat details
HEADER_MAT_NEW{filenum,1} = HEADER_MAT{filenum,1};
HEADER_MAT_NEW{filenum,2} = size(newfeat,1);
HEADER_MAT_NEW{filenum, 3} = featdim;
HEADER_MAT_NEW{filenum, 4} = framestart;
HEADER_MAT_NEW{filenum, 5} = frameend;
% add features to featuremat
FEATURE_MAT_NEW (framestart:frameend,:) = feat;
fileindex = fileindex + 1;
framestart = frameend + 1;
end
% write the ark file
disp ('Writing ark files...')
arkwrite(destark_filename, HEADER_MAT_NEW, FEATURE_MAT_NEW);
ark2scp(destark_filename);
end % splitindex
end % set num
Now the features successfully stored to ark files and the corresponding scp files are generated. We will next have to generate the feats.scp
file by concatenating the scp files which are split into 10 subsets. You can do it in a terminal using:
nj=10
for x in test train dev; do
rm -f data-myfeats/$x/feats.scp
for n in $(seq $nj); do
cat $feat/raw_myfeat_${x}.$n.scp || exit 1;
done > data-myfeats/$name/feats.scp
done
Now you can change the data directory in training scripts to data-myfeats
to train and decode using kaldi using your custom features.
Custom features using Python
As mentioned in the above section, snce kaldi makes use of a unique ID to read the ark files, we will need to follow the same format. For this we will use the already prepared FBANK features as a reference file to generate our custom features. In order to do this with python, we will need python routines to read and write ark files.
The kaldiio is a pure python module for reading and writing kaldi ark files. You can install it to your machine/environment using pip.
pip install kaldiio
We will make use of the load_ark
and save_ark
commands. So the general procedure is follows: Use load_ark
to read the ark files which will return the unique id and the corresponding feature matrix. The idea is to change the feature matrix with our custom features, use it with the unique id and to write it using save_ark
. We will also use save_ark
to generate the corresponding .scp file.
A sample python code is given below.
import kaldiio
import os
import numpy as np
import shutil
import scipy as sp
import scipy.signal as sp_sig
def copy_and_overwrite(from_path, to_path):
if os.path.exists(to_path):
shutil.rmtree(to_path)
shutil.copytree(from_path, to_path)
basedir = os.getcwd()
setlist = ['test', 'train', 'dev']
featdim = 64 # insert your feature dim here
tags = 'myfeats'
splits = 10
fbankdatadir = os.path.join(basedir, 'data-fbank')
fbankfeatdir = os.path.join (basedir, 'fbank')
datadir = os.path.join(basedir, 'data-'+ tags)
featdir = os.path.join(basedir, tags);
# create dirs if they dont exist
if not os.path.isdir(datadir):
os.makedirs(datadir)
if not os.path.isdir(featdir):
os.makedirs(featdir)
for setnum in range(len(setlist)):
setname = setlist[setnum];
print ('Processing ' + setname + ' data.')
# copy files
src_dir = os.path.join(fbankdatadir, setname)
dst_dir = os.path.join(datadir, setname)
copy_and_overwrite(src_dir, dst_dir)
filelist = os.path.join(datadir, setname, 'wav.scp')
dtag = setname
filenames = []
keys_orig = []
for line in open(filelist):
filenames.append(line.split()[4]) # wavenames is in the 5th column
keys_orig.append(line.split()[0]) # key or id is in the first column
numfiles = len(filenames)
fileindex = 0
for splitindex in range(splits):
print ('Processing ' + str(splitindex) + ' of ' + str(splits) + ' split data.')
srcark = 'raw_fbank_' + dtag + '.' + str(splitindex + 1) + '.ark'
srcscp = 'raw_fbank_' + dtag + '.' + str(splitindex + 1) + '.scp'
destark = 'raw_' + tags + '_' + dtag + '.' + str(splitindex + 1) + '.ark'
destscp = 'raw_' + tags + '_' + dtag + '.' + str(splitindex + 1) + '.scp'
srcark_filename = os.path.join(fbankfeatdir, srcark)
destark_filename = os.path.join(featdir, destark)
srcscp_filename = os.path.join(fbankfeatdir, srcscp)
destscp_filename = os.path.join(featdir, destscp)
d = kaldiio.load_ark(srcark_filename)
write_dict={} # kaldiio uses features in the form of a dict
for key, array in d:
# check if keys match
if keys_orig[fileindex] != key :
raise ValueError('ID Mismatch!')
fname = filenames[fileindex]
# read audio, assuming wav file
# change this if the file is of some other type
fs, sig = sp.io.wavfile.read(fname)
# Do your feature extraction below:
# expected shape is (numframes x featdim)
newfeat = extract_my_feats(sig)
# do the numframes check, if we are reusing existing alignments
# else, comment out this check
if array.shape[0] != newfeat.shape[0] :
raise ValueError('Dimension mismatch!')
# append the features into the dictionary
write_dict [key] = newfeat.astype(np.float32) # let us store it as float32
fileindex += 1
# write features to disk
print ("Writing to " + destark_filename)
kaldiio.save_ark(destark_filename, write_dict, scp=destscp_filename)
Notice that kaldiio.save_ark
with the scp
variable set to a destination name will also generate the scp file together with the ark write. We will next have to generate the feats.scp
file by concatenating the scp files which are split into 10 subsets. You can do it in a terminal using:
nj=10
for x in test train dev; do
rm -f data-myfeats/$x/feats.scp
for n in $(seq $nj); do
cat $feat/raw_myfeat_${x}.$n.scp || exit 1;
done > data-myfeats/$name/feats.scp
done
Now you can change the data directory in training scripts to data-myfeats
to train and decode using kaldi using your custom features.