Prerequisites¶

Data preprocessing¶

DeepBiome package takes microbiome abundance data as input and uses the phylogenetic taxonomy to guide the decision of the optimal number of layers and neurons in the deep learning architecture.

To use DeepBiome, you can experiment (1) `k` times repetition or (2) `k` fold cross-validation. For each experiment, we asuume that the dataset is given by

A list of `k` input files for `k` times repetition., or

One input file for `k` fold cross-validation.

With a list of k inputs for k times repetition¶

DeepBiome needs 4 data files as follows:

the tree information

the list of the input files (each file has all sample’s information for one repetition)

the list of the names of input files

y

For k times repetition, we can use the list of k input files. Each file has all sample’s information for one repetition. In addition, we can set the training index for each repetition. If we set the index file, DeepBiome builds the training set for each repetition based on each fold index in the index file. If not, DeepBiome will generate the index file locally.

Eath data should have the csv format as follows:

tree information (.csv)

A file about the phylogenetic tree information. Below is an example file of the phylogenetic tree information dictionary

Example of genus dictionary¶
Genus	Family	Order	Class	Phylum	Domain
Streptococcus	Streptococcaceae	Lactobacillales	Bacilli	Firmicutes	Bacteria
Tropheryma	Cellulomonadaceae	Actinomycetales	Actinobacteria	Actinobacteria	Bacteria
Veillonella	Veillonellaceae	Selenomonadales	Negativicutes	Firmicutes	Bacteria
Actinomyces	Actinomycetaceae	Actinomycetales	Actinobacteria	Actinobacteria	Bacteria
Flavobacterium	Flavobacteriaceae	Flavobacteriales	Flavobacteria	Bacteroidetes	Bacteria
Prevotella	Prevotellaceae	Bacteroidales	Bacteroidia	Bacteroidetes	Bacteria
Porphyromonas	Porphyromonadaceae	Bacteroidales	Bacteroidia	Bacteroidetes	Bacteria
Parvimonas	Clostridiales_Incertae_Sedis_XI	Clostridiales	Clostridia	Firmicutes	Bacteria

list of the names of k input files (.csv)

If we want to use the list of the input files, we need the make a list of the names of each input file. Below is an example file for k=4 repetition.

Example of the list of k input file names¶
gcount_0001.csv
gcount_0002.csv
gcount_0003.csv
gcount_0004.csv

list of k input files

Each file should have each repetition’s sample microbiome abandunce. Below is an example file for k=4 repetition. This example is gcount_0001.csv for the first repetition in the list of the names of input files above. This file has the 4 samples’ microbiome abandunce.

Example of one input file (.csv) of k inputs.¶
Streptococcus	Tropheryma	Veillonella	Actinomyces	Flavobacterium	Prevotella	Porphyromonas	Parvimonas	Fusobacterium	Propionibacterium	Gemella	Rothia	Granulicatella	Neisseria	Lactobacillus	Megasphaera	Catonella	Atopobium	Campylobacter	Capnocytophaga	Solobacterium	Moryella	TM7_genera_incertae_sedis	Staphylococcus	Filifactor	Oribacterium	Burkholderia	Sneathia	Treponema	Moraxella	Haemophilus	Selenomonas	Corynebacterium	Rhizobium	Bradyrhizobium	Methylobacterium	OD1_genera_incertae_sedis	Finegoldia	Microbacterium	Sphingomonas	Chryseobacterium	Bacteroides	Bdellovibrio	Streptophyta	Lachnospiracea_incertae_sedis	Paracoccus	Fastidiosipila	Pseudonocardia
841	0	813	505	5	3224	0	362	11	65	156	1	55	0	1	20	382	1	333	24	80	43	309	2	3	4	0	1	32	0	2	4	382	0	0	96	23	0	0	87	0	0	0	0	0	0	0	2133
1445	0	1	573	0	1278	82	85	69	154	436	3	0	61	440	0	394	83	33	123	0	49	414	0	0	37	0	0	42	0	0	384	27	0	0	0	146	0	0	1	2	0	0	0	0	0	0	3638
1259	0	805	650	0	1088	0	0	74	0	155	228	430	765	0	0	11	102	68	90	77	83	322	10	0	7	0	122	76	0	1	25	0	0	0	44	13	0	0	2	8	1	39	0	0	0	0	3445
982	0	327	594	0	960	81	19	9	0	45	457	1049	0	3	450	19	170	388	147	0	0	41	63	0	1	0	0	121	0	0	1	0	0	0	0	344	0	157	1	0	4	60	0	0	0	0	3507

y (.csv)

One column contains y samples for one repetition. Below is an example file for k=4 repetition. For each repetition (column) has outputs of 4 samples for each repeatition.

Example of y file (.csv)¶
V1	V2	V3	V4
1.0	0.0	1.0	1.0
1.0	1.0	1.0	1.0
0.0	1.0	0.0	1.0
0.0	1.0	1.0	0.0

index for training set for each repetition (.csv)

For each repetition, we have to set the training and test set. If the index file is given, DeepBiome sets the training set and test set based on the index file. Below is the example of the index file. Each column has the training indices for each repetition. DeepBiome will only use the samples in this index set for training. Below is an example for k=4 repetition

Example of index file (.csv)¶
V1	V2	V3	V4
0	1	2	3
1	2	3	0
2	3	0	1

In the example above, we used the first 3 rows of the first column in y.csv for the training set in the first repetition.

With one input file for k fold cross-validation¶

DeepBiome needs 3 data files as follows:

the tree information

the input file

y

For k fold cross-validation, we can use an input file. In addition, we can set the training index for each fold. If we set the index file, DeepBiome builds the training set for each fold based on each fold index in the index file. If not, DeepBiome will generate the index file locally.

Eath data should have the csv format as follows:

tree information (.csv)

A file about the phylogenetic tree information. Below is an example file of the phylogenetic tree information dictionary

Example of genus dictionary¶
Genus	Family	Order	Class	Phylum	Domain
Streptococcus	Streptococcaceae	Lactobacillales	Bacilli	Firmicutes	Bacteria
Tropheryma	Cellulomonadaceae	Actinomycetales	Actinobacteria	Actinobacteria	Bacteria
Veillonella	Veillonellaceae	Selenomonadales	Negativicutes	Firmicutes	Bacteria
Actinomyces	Actinomycetaceae	Actinomycetales	Actinobacteria	Actinobacteria	Bacteria
Flavobacterium	Flavobacteriaceae	Flavobacteriales	Flavobacteria	Bacteroidetes	Bacteria
Prevotella	Prevotellaceae	Bacteroidales	Bacteroidia	Bacteroidetes	Bacteria
Porphyromonas	Porphyromonadaceae	Bacteroidales	Bacteroidia	Bacteroidetes	Bacteria
Parvimonas	Clostridiales_Incertae_Sedis_XI	Clostridiales	Clostridia	Firmicutes	Bacteria

input file

Input file has the microbiome abandunce of each sample. Below is an example file with the 4 samples’ microbiome abandunce.

Example of input file (.csv)¶
Streptococcus	Tropheryma	Veillonella	Actinomyces	Flavobacterium	Prevotella	Porphyromonas	Parvimonas	Fusobacterium	Propionibacterium	Gemella	Rothia	Granulicatella	Neisseria	Lactobacillus	Megasphaera	Catonella	Atopobium	Campylobacter	Capnocytophaga	Solobacterium	Moryella	TM7_genera_incertae_sedis	Staphylococcus	Filifactor	Oribacterium	Burkholderia	Sneathia	Treponema	Moraxella	Haemophilus	Selenomonas	Corynebacterium	Rhizobium	Bradyrhizobium	Methylobacterium	OD1_genera_incertae_sedis	Microbacterium	Sphingomonas	Chryseobacterium	Bdellovibrio	Streptophyta	Finegoldia	Bacteroides	Lachnospiracea_incertae_sedis	Paracoccus	Fastidiosipila	Pseudonocardia
6244.0	0.0	2985.0	204.0	5.0	3548.0	308.0	53.0	506.0	3.0	324.0	669.0	795.0	0.0	3686.0	6.0	41.0	609.0	11.0	0.0	3.0	0.0	0.0	31.0	0.0	0.0	3.0	0.0	93.0	0.0	1.0	9.0	0.0	0.0	0.0	3.0	2.0	5.0	0.0	0.0	0.0	2.0	0.0	0.0	0.0	0.0	0.0	0.0
3573.0	3.0	2566.0	975.0	0.0	7195.0	377.0	39.0	442.0	69.0	602.0	45.0	527.0	4536.0	324.0	105.0	626.0	400.0	1130.0	1132.0	127.0	9.0	426.0	16.0	0.0	150.0	1.0	244.0	248.0	0.0	19.0	26.0	9.0	0.0	0.0	14.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
516.0	0.0	227.0	39.0	0.0	180.0	54.0	38.0	80.0	2.0	84.0	26.0	18.0	0.0	2235.0	4.0	3.0	3.0	8.0	2.0	10.0	2.0	7.0	0.0	0.0	0.0	0.0	78.0	7.0	0.0	4.0	6.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
881.0	1.0	47.0	4.0	0.0	13.0	5.0	1.0	108.0	1.0	7.0	6.0	6.0	117.0	0.0	0.0	2.0	0.0	4.0	6.0	0.0	5.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	9.0	3.0	6.0	0.0	0.0	0.0	2.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0

y (.csv)

Below is an example file of the outputs of 4 samples.

Example of y file (.csv)¶
y
0.78
1.01
0.92
0.91

index for training set for each fold (.csv)

For each fold, we have to set the training and test set. If the index file is given, DeepBiome sets the training set and test set based on the index file. Below is an example of the index file. Each column has the training indices for each fold. DeepBiome will only use the samples in this index set for training. Below is an example for k=4 fold

Example of index file (.csv)¶
V1	V2	V3	V4
0	1	2	3
1	2	3	0
2	3	0	1

In the example above, we used the first 3 rows of the first column in y.csv for the training set in the first fold.

Configuration¶

For detailed configuration, we used python dictionary as inputs for the main training function.

Preparing the configuration about the network information (network_info)¶

To give the information about the training hyper-parameter, we provide a dictionary of configuration to the netowrk_info field. Alternatively we can use the configufation file (.cfg).

Configuration for the network training should include the information about:

model_info: about the training method and metrics
architecture_info: about the architecture options
training_info: about the hyper-parameter for training (not required for testing and prediction)
validation_info: about the hyper-parameter for validation (not required for testing and prediction)
test_info: about the hyper-parameter for testing

Note

You don’t have to fill the options if it has a default value.

network_info[‘model_info’]¶

Detailed options for the model_info field are as follows.

network_clas

DeepBiome network class (default=’DeepBiomeNetwork’).

reader_class

reader classes

possible options	explanation
“MicroBiomeRegressionReader”	Microbiome adandunce data reader for regression problem
“MicroBiomeClassificationReader”	Microbiome adandunce data reader for classification problem

optimizer

optimization methods for training the network. We used the optimizers implemented in Keras (See Optimizer).

possible options	explanation
“adam”	Adam optimizer
“sgd”	stocastic gradient decent optimizer

lr

learning rate for the optimizor. (float between 0 ~ 1)

decay

learning late decay ratio for the optimizer. (float between 0 ~ 1)

loss

loss functions for training the network

possible options	explanation
“mean_squared_error”	for regression problem
“binary_crossentropy”	for binary classification problem
“categorical_crossentropy”	for multi-class classification problem

metrics

additional metrics to check the model performance

possible options	explanation
“correlation_coefficient”	Pearson correlation coefficient (-1 ~ 1)
“binary_accuracy”	Accuracy for binary classification problem (0 ~ 1)
“categorical_accuracy”	Accuracy for multi-class classification problem (0 ~ 1)
“sensitivity”	Sensitivity (0 ~ 1)
“specificity”	Specificity (0 ~ 1)
“gmeasure”	(Sensitivity * Specificity) ^ (0.5) (0 ~ 1)
“auc”	Area under the receiver operating characteristics (0 ~ 1)
“precision”	Precision (0 ~ 1)
“recall”	Recall (0 ~ 1)
“f1”	F1 score (0 ~ 1)

taxa_selection_metrics

metrics for the texa selection performance

possible options	explanation
“accuracy”	Accuracy (-1 ~ 1)
“sensitivity”	Sensitivity (0 ~ 1)
“specificity”	Specificity (0 ~ 1)
“gmeasure”	(Sensitivity * Specificity) ^ (0.5) (0 ~ 1)

normalizer

normalizer for the input data (default=`normalize_minmax`)

network_info[‘architecture_info’]¶

Detailed options for the architecture_info field are as follows.

Combination of the options below will provide you the network training method DNN, DNN+L1 and Deepbiome in the reference (url. TBD)

weight_initial

network weight initialization

possible options	explanation
“glorot_uniform”	Glorot uniform initializer (defualt)
“he_normal”	He normal initializer
“phylogenetic_tree”	weight within the tree connection: 1; weight without the tree connection: 0
“phylogenetic_tree_glorot_uniform”	weight within the tree connection: glorot_uniform; weight without the tree connection: 0
“phylogenetic_tree_he_normal”	weight within the tree connection: he_normal; weight without the tree connection: 0

weight_l1_penalty

\(\lambda\) for l1 penalty (float. defaut = 0)

weight_l2_penalty

\(\lambda\) for l2 penalty (float. defaut = 0)

weight_decay

DeepBiome with the phylogenetic tree based weight decay method (default = “”: without deepbiome weight decay method)

possible options	explanation
“phylogenetic_tree”	weight decay method based on the phylogenetic tree information with small amout of noise (\(\epsilon \le 1e-2\))
“phylogenetic_tree_wo_noise”	weight decay method based on the phylogenetic tree information without any noise outside the tree

batch_normalization

options for adding the batch normalization for each convolutional layer (default = False)

drop_out

options for adding the drop out for each convolutional layer with given ratio (default = 0)

Hint

Example of the combination of the options in the reference paper (url TBD):

training method	combination of the options
DNN	“weight_initial”=”glorot_uniform”
DNN+L1	“weight_initial”=”glorot_uniform”, “weight_l1_penalty”=”0.01”
DeepBiome	“weight_initial”=”glorot_uniform”, “weight_deacy”=”phylogenetic_tree”

network_info[‘training_info’]¶

Detailed options for the training_info field are as follows.

epochs

number of the epoch for training (integer)

batch_size

number of the batch size for each mini-batch (integer)

callbacks

callback class implemented in Keras (See Callbacks)

possible options	explanation
“ModelCheckpoint”	save the best model weight based on the monitor (See ModelCheckpoint)
“EarlyStopping”	early stopping the training before the number of epochs epochs based on the monitor (See EarlyStopping)

monitor

monitor value for the ModelCheckpoint, EarlyStoppoing callbacks (e.g. val_loss, val_accuray)

mode

how to use the monitor value for the ModelCheckpoint, EarlyStopping callbacks

possible options	explanation
“min”	for example: when using the monitor val_loss
“max”	for example: when using the monitor val_accuray

patience

patience for the EarlyStopping callback (integer; default = 20)

min_delta

the minimum threshold for the ModelCheckpoint, EarlyStopping callbacks (float; default = 1e-4)

network_info[‘validation_info’]¶

Detailed options for the validation_info field are as follows.

validation_size: the ratio of the number of the samples in the validation set / the number of the samples in the training set(e.g. “0.2”)
batch_size: the batch size for each mini-batch. If “None”, use the whole number of the sample as one mini-batch. (defualt = “None”)

network_info[‘test_info’]¶

Detailed options for the test_info field are as follows.

batch_size: the batch size for each mini-batch. If “None”, use the whole number of the sample as one mini-batch. (defualt = “None”)

Example for the network_info¶

This is the example of the configuration dictionary: network_info dictionary

network_info = {
    'architecture_info': {
        'batch_normalization': 'False',
        'drop_out': '0',
        'weight_initial': 'glorot_uniform',
        'weight_l1_penalty':'0.01',
        'weight_decay': 'phylogenetic_tree',
    },
    'model_info': {
        'decay': '0.001',
        'loss': 'binary_crossentropy',
        'lr': '0.01',
        'metrics': 'binary_accuracy, sensitivity, specificity, gmeasure, auc',
        'network_class': 'DeepBiomeNetwork',
        'normalizer': 'normalize_minmax',
        'optimizer': 'adam',
        'reader_class': 'MicroBiomeClassificationReader',
        'taxa_selection_metrics': 'accuracy, sensitivity, specificity, gmeasure'
    },
    'training_info': {
        'batch_size': '200',
        'epochs': '10',
        'callbacks': 'ModelCheckpoint',
        'monitor': 'val_binary_accuracy',
        'mode': 'max',
        'min_delta': '1e-4',
    },
    'validation_info': {
        'batch_size': 'None', 'validation_size': '0.2'
    },
    'test_info': {
        'batch_size': 'None'
    }
}

This is the example of the configuration file: network_info.cfg

[model_info]
network_class = DeepBiomeNetwork
optimizer   = adam
lr          = 0.01
decay       = 0.0001
loss        = binary_crossentropy
metrics     = binary_accuracy, sensitivity, specificity, gmeasure, auc
texa_selection_metrics = accuracy, sensitivity, specificity, gmeasure
reader_class = MicroBiomeClassificationReader
normalizer  = normalize_minmax

[architecture_info]
weight_initial = glorot_uniform
weight_decay = phylogenetic_tree
batch_normalization = False
drop_out = 0

[training_info]
epochs          = 1000
batch_size      = 200
callbacks       = ModelCheckpoint
monitor         = val_binary_accuracy
mode            = max
min_delta       = 1e-4

[validation_info]
validation_size = 0.2
batch_size = None

[test_info]
batch_size = None

Hint

See Example for reference about the configuration file example for various problems.

Preparing the configuration about the path information (path_info)¶

To give the information about the path to dataset, paths for saving the trained weights and the evaluation results, we provide a dictionary of configurations to the path_info feild. Alternatively we can also use the configufation file (.cfg).

Your configuration for the paths should include the information about:

data_info: about the path information of the dataset
model_info: about the path information for saving the trained weights and the evaluation results

Note

All paths are the relative path based on the directory where code will run.

path_info[‘data_info’]¶

To provide the dictionary as input, we can use the option below:

tree_info_path: tree information file (.csv)
count_list_path: lists of the name of input files (.csv)
count_path: directory path of the input files
y_path: y path (.csv) (not required for prediction)
idx_path: index path for repetation (.csv)
data_path: directory path of the index and y file

To provide one configuration file, we can use the options below:

tree_info_path: tree information file (.csv)
x_path: input path (.csv)
y_path: y path (.csv) (not required for prediction)
data_path: directory path of the index, x and y file

path_info[‘model_info’]¶

weight: weight file name (.h5)
evaluation: evaluation file name (.npy) (not required for prediction)
model_dir: base directory path for the model (weight, evaluation)
history: history file name for the history value of each evaluation metric from the training (.json). If not setted, deepbiome will not save the history of the network training.

Warning

If you want to use sub-directories in the path (for example, “weight”=”weight/weight.h5”, “history”=”history/hist.h5”, “model_dir”=”./”), you should have to make the sub-directories “./weight” and “./history” before running the code.

Example for the path_info for the list of inputs¶

This is the example of the configuration dictionary: path_info dictionary

path_info = {
    'data_info': {
        'count_list_path': 'data/simulation/gcount_list.csv',
        'count_path': 'data/simulation/count',
        'data_path': 'data/simulation/s2/',
        'idx_path': 'data/simulation/s2/idx.csv',
        'tree_info_path': 'data/genus48/genus48_dic.csv',
        'x_path': '',
        'y_path': 'y.csv'
    },
    'model_info': {
        'model_dir': './simulation_s2/simulation_s2_deepbiome/',
        'weight': 'weight/weight.h5',
        'history': 'hist.json',
        'evaluation': 'eval.npy'
    }
}

This is the example of the configuration file: path_info.cfg

[data_info]
data_path = data/simulation/s2/
tree_info_path = data/genus48/genus48_dic.csv
idx_path = data/simulation/s2/idx.csv
count_list_path = data/simulation/gcount_list.csv
count_path = data/simulation/count
y_path = y.csv

[model_info]
model_dir = ./simulation_s2/simulation_s2_deepbiome/
weight = weight/weight.h5
history = historys/hist.json
evaluation = eval.npy

Example for the path_info for the one input file¶

This is the example of the configuration dictionary: path_info dictionary

path_info = {
    'data_info': {
        'data_path': '../../data/pulmonary/',
        'tree_info_path': '../../data/genus48/genus48_dic.csv',
        'x_path': 'X.csv',
        'y_path': 'y.csv'
    },
    'model_info': {
        'model_dir': './',
        'weight': 'weight/weight.h5',
        'history':'history/hist.json',
        'evaluation': 'eval.npy',
    }
}

This is the example of the configuration file: path_info.cfg

[data_info]
data_path = ../../data/pulmonary/
tree_info_path = ../../data/genus48/genus48_dic.csv
x_path = X.csv
y_path = y.csv

[model_info]
model_dir = ./
weight = weight/weight.h5
history = history/hist.json
evaluation = eval.npy

Hint

See Example for reference about the configuration file example for various problems.