Prerequisites¶
Data preprocessing¶
DeepBiome package takes microbiome abundance data as input and uses the phylogenetic taxonomy to guide the decision of the optimal number of layers and neurons in the deep learning architecture.
To use DeepBiome, you can experiment (1) `k` times repetition or (2) `k` fold cross-validation. For each experiment, we asuume that the dataset is given by
A list of `k` input files for `k` times repetition., or
One input file for `k` fold cross-validation.
With a list of k inputs for k times repetition¶
DeepBiome needs 4 data files as follows:
the tree information
the list of the input files (each file has all sample’s information for one repetition)
the list of the names of input files
y
For k times repetition, we can use the list of k input files. Each file has all sample’s information for one repetition. In addition, we can set the training index for each repetition. If we set the index file, DeepBiome builds the training set for each repetition based on each fold index in the index file. If not, DeepBiome will generate the index file locally.
Eath data should have the csv format as follows:
- tree information (.csv)
A file about the phylogenetic tree information. Below is an example file of the phylogenetic tree information dictionary
¶ Genus
Family
Order
Class
Phylum
Domain
Streptococcus
Streptococcaceae
Lactobacillales
Bacilli
Firmicutes
Bacteria
Tropheryma
Cellulomonadaceae
Actinomycetales
Actinobacteria
Actinobacteria
Bacteria
Veillonella
Veillonellaceae
Selenomonadales
Negativicutes
Firmicutes
Bacteria
Actinomyces
Actinomycetaceae
Actinomycetales
Actinobacteria
Actinobacteria
Bacteria
Flavobacterium
Flavobacteriaceae
Flavobacteriales
Flavobacteria
Bacteroidetes
Bacteria
Prevotella
Prevotellaceae
Bacteroidales
Bacteroidia
Bacteroidetes
Bacteria
Porphyromonas
Porphyromonadaceae
Bacteroidales
Bacteroidia
Bacteroidetes
Bacteria
Parvimonas
Clostridiales_Incertae_Sedis_XI
Clostridiales
Clostridia
Firmicutes
Bacteria
- list of the names of k input files (.csv)
If we want to use the list of the input files, we need the make a list of the names of each input file. Below is an example file for k=4 repetition.
¶ gcount_0001.csv
gcount_0002.csv
gcount_0003.csv
gcount_0004.csv
- list of k input files
Each file should have each repetition’s sample microbiome abandunce. Below is an example file for k=4 repetition. This example is gcount_0001.csv for the first repetition in the list of the names of input files above. This file has the 4 samples’ microbiome abandunce.
¶ Streptococcus
Tropheryma
Veillonella
Actinomyces
Flavobacterium
Prevotella
Porphyromonas
Parvimonas
Fusobacterium
Propionibacterium
Gemella
Rothia
Granulicatella
Neisseria
Lactobacillus
Megasphaera
Catonella
Atopobium
Campylobacter
Capnocytophaga
Solobacterium
Moryella
TM7_genera_incertae_sedis
Staphylococcus
Filifactor
Oribacterium
Burkholderia
Sneathia
Treponema
Moraxella
Haemophilus
Selenomonas
Corynebacterium
Rhizobium
Bradyrhizobium
Methylobacterium
OD1_genera_incertae_sedis
Finegoldia
Microbacterium
Sphingomonas
Chryseobacterium
Bacteroides
Bdellovibrio
Streptophyta
Lachnospiracea_incertae_sedis
Paracoccus
Fastidiosipila
Pseudonocardia
841
0
813
505
5
3224
0
362
11
65
156
1
55
0
1
20
382
1
333
24
80
43
309
2
3
4
0
1
32
0
2
4
382
0
0
96
23
0
0
87
0
0
0
0
0
0
0
2133
1445
0
1
573
0
1278
82
85
69
154
436
3
0
61
440
0
394
83
33
123
0
49
414
0
0
37
0
0
42
0
0
384
27
0
0
0
146
0
0
1
2
0
0
0
0
0
0
3638
1259
0
805
650
0
1088
0
0
74
0
155
228
430
765
0
0
11
102
68
90
77
83
322
10
0
7
0
122
76
0
1
25
0
0
0
44
13
0
0
2
8
1
39
0
0
0
0
3445
982
0
327
594
0
960
81
19
9
0
45
457
1049
0
3
450
19
170
388
147
0
0
41
63
0
1
0
0
121
0
0
1
0
0
0
0
344
0
157
1
0
4
60
0
0
0
0
3507
- y (.csv)
One column contains y samples for one repetition. Below is an example file for k=4 repetition. For each repetition (column) has outputs of 4 samples for each repeatition.
¶ V1
V2
V3
V4
1.0
0.0
1.0
1.0
1.0
1.0
1.0
1.0
0.0
1.0
0.0
1.0
0.0
1.0
1.0
0.0
- index for training set for each repetition (.csv)
For each repetition, we have to set the training and test set. If the index file is given, DeepBiome sets the training set and test set based on the index file. Below is the example of the index file. Each column has the training indices for each repetition. DeepBiome will only use the samples in this index set for training. Below is an example for k=4 repetition
¶ V1
V2
V3
V4
0
1
2
3
1
2
3
0
2
3
0
1
In the example above, we used the first 3 rows of the first column in y.csv for the training set in the first repetition.
With one input file for k fold cross-validation¶
DeepBiome needs 3 data files as follows:
the tree information
the input file
y
For k fold cross-validation, we can use an input file. In addition, we can set the training index for each fold. If we set the index file, DeepBiome builds the training set for each fold based on each fold index in the index file. If not, DeepBiome will generate the index file locally.
Eath data should have the csv format as follows:
- tree information (.csv)
A file about the phylogenetic tree information. Below is an example file of the phylogenetic tree information dictionary
¶ Genus
Family
Order
Class
Phylum
Domain
Streptococcus
Streptococcaceae
Lactobacillales
Bacilli
Firmicutes
Bacteria
Tropheryma
Cellulomonadaceae
Actinomycetales
Actinobacteria
Actinobacteria
Bacteria
Veillonella
Veillonellaceae
Selenomonadales
Negativicutes
Firmicutes
Bacteria
Actinomyces
Actinomycetaceae
Actinomycetales
Actinobacteria
Actinobacteria
Bacteria
Flavobacterium
Flavobacteriaceae
Flavobacteriales
Flavobacteria
Bacteroidetes
Bacteria
Prevotella
Prevotellaceae
Bacteroidales
Bacteroidia
Bacteroidetes
Bacteria
Porphyromonas
Porphyromonadaceae
Bacteroidales
Bacteroidia
Bacteroidetes
Bacteria
Parvimonas
Clostridiales_Incertae_Sedis_XI
Clostridiales
Clostridia
Firmicutes
Bacteria
- input file
Input file has the microbiome abandunce of each sample. Below is an example file with the 4 samples’ microbiome abandunce.
¶ Streptococcus
Tropheryma
Veillonella
Actinomyces
Flavobacterium
Prevotella
Porphyromonas
Parvimonas
Fusobacterium
Propionibacterium
Gemella
Rothia
Granulicatella
Neisseria
Lactobacillus
Megasphaera
Catonella
Atopobium
Campylobacter
Capnocytophaga
Solobacterium
Moryella
TM7_genera_incertae_sedis
Staphylococcus
Filifactor
Oribacterium
Burkholderia
Sneathia
Treponema
Moraxella
Haemophilus
Selenomonas
Corynebacterium
Rhizobium
Bradyrhizobium
Methylobacterium
OD1_genera_incertae_sedis
Microbacterium
Sphingomonas
Chryseobacterium
Bdellovibrio
Streptophyta
Finegoldia
Bacteroides
Lachnospiracea_incertae_sedis
Paracoccus
Fastidiosipila
Pseudonocardia
6244.0
0.0
2985.0
204.0
5.0
3548.0
308.0
53.0
506.0
3.0
324.0
669.0
795.0
0.0
3686.0
6.0
41.0
609.0
11.0
0.0
3.0
0.0
0.0
31.0
0.0
0.0
3.0
0.0
93.0
0.0
1.0
9.0
0.0
0.0
0.0
3.0
2.0
5.0
0.0
0.0
0.0
2.0
0.0
0.0
0.0
0.0
0.0
0.0
3573.0
3.0
2566.0
975.0
0.0
7195.0
377.0
39.0
442.0
69.0
602.0
45.0
527.0
4536.0
324.0
105.0
626.0
400.0
1130.0
1132.0
127.0
9.0
426.0
16.0
0.0
150.0
1.0
244.0
248.0
0.0
19.0
26.0
9.0
0.0
0.0
14.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
516.0
0.0
227.0
39.0
0.0
180.0
54.0
38.0
80.0
2.0
84.0
26.0
18.0
0.0
2235.0
4.0
3.0
3.0
8.0
2.0
10.0
2.0
7.0
0.0
0.0
0.0
0.0
78.0
7.0
0.0
4.0
6.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
1.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
881.0
1.0
47.0
4.0
0.0
13.0
5.0
1.0
108.0
1.0
7.0
6.0
6.0
117.0
0.0
0.0
2.0
0.0
4.0
6.0
0.0
5.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
9.0
3.0
6.0
0.0
0.0
0.0
2.0
0.0
0.0
0.0
0.0
1.0
0.0
0.0
0.0
0.0
0.0
0.0
- y (.csv)
Below is an example file of the outputs of 4 samples.
¶ y
0.78
1.01
0.92
0.91
- index for training set for each fold (.csv)
For each fold, we have to set the training and test set. If the index file is given, DeepBiome sets the training set and test set based on the index file. Below is an example of the index file. Each column has the training indices for each fold. DeepBiome will only use the samples in this index set for training. Below is an example for k=4 fold
¶ V1
V2
V3
V4
0
1
2
3
1
2
3
0
2
3
0
1
In the example above, we used the first 3 rows of the first column in y.csv for the training set in the first fold.
Configuration¶
For detailed configuration, we used python dictionary as inputs for the main training function.
Preparing the configuration about the network information (network_info)¶
To give the information about the training hyper-parameter, we provide a dictionary of configuration to the netowrk_info field. Alternatively we can use the configufation file (.cfg).
Configuration for the network training should include the information about:
- model_info
about the training method and metrics
- architecture_info
about the architecture options
- training_info
about the hyper-parameter for training (not required for testing and prediction)
- validation_info
about the hyper-parameter for validation (not required for testing and prediction)
- test_info
about the hyper-parameter for testing
Note
You don’t have to fill the options if it has a default value.
network_info[‘model_info’]¶
Detailed options for the model_info field are as follows.
- network_clas
DeepBiome network class (default=’DeepBiomeNetwork’).
- reader_class
reader classes
possible options
explanation
“MicroBiomeRegressionReader”
Microbiome adandunce data reader for regression problem
“MicroBiomeClassificationReader”
Microbiome adandunce data reader for classification problem
- optimizer
optimization methods for training the network. We used the optimizers implemented in Keras (See Optimizer).
possible options
explanation
“adam”
Adam optimizer
“sgd”
stocastic gradient decent optimizer
- lr
learning rate for the optimizor. (float between 0 ~ 1)
- decay
learning late decay ratio for the optimizer. (float between 0 ~ 1)
- loss
loss functions for training the network
possible options
explanation
“mean_squared_error”
for regression problem
“binary_crossentropy”
for binary classification problem
“categorical_crossentropy”
for multi-class classification problem
- metrics
additional metrics to check the model performance
possible options
explanation
“correlation_coefficient”
Pearson correlation coefficient (-1 ~ 1)
“binary_accuracy”
Accuracy for binary classification problem (0 ~ 1)
“categorical_accuracy”
Accuracy for multi-class classification problem (0 ~ 1)
“sensitivity”
Sensitivity (0 ~ 1)
“specificity”
Specificity (0 ~ 1)
“gmeasure”
(Sensitivity * Specificity) ^ (0.5) (0 ~ 1)
“auc”
Area under the receiver operating characteristics (0 ~ 1)
“precision”
Precision (0 ~ 1)
“recall”
Recall (0 ~ 1)
“f1”
F1 score (0 ~ 1)
- taxa_selection_metrics
metrics for the texa selection performance
possible options
explanation
“accuracy”
Accuracy (-1 ~ 1)
“sensitivity”
Sensitivity (0 ~ 1)
“specificity”
Specificity (0 ~ 1)
“gmeasure”
(Sensitivity * Specificity) ^ (0.5) (0 ~ 1)
- normalizer
normalizer for the input data (default=`normalize_minmax`)
network_info[‘architecture_info’]¶
Detailed options for the architecture_info field are as follows.
Combination of the options below will provide you the network training method DNN, DNN+L1 and Deepbiome in the reference (url. TBD)
- weight_initial
network weight initialization
possible options
explanation
“glorot_uniform”
Glorot uniform initializer (defualt)
“he_normal”
He normal initializer
“phylogenetic_tree”
weight within the tree connection: 1; weight without the tree connection: 0
“phylogenetic_tree_glorot_uniform”
weight within the tree connection: glorot_uniform; weight without the tree connection: 0
“phylogenetic_tree_he_normal”
weight within the tree connection: he_normal; weight without the tree connection: 0
- weight_l1_penalty
\(\lambda\) for l1 penalty (float. defaut = 0)
- weight_l2_penalty
\(\lambda\) for l2 penalty (float. defaut = 0)
- weight_decay
DeepBiome with the phylogenetic tree based weight decay method (default = “”: without deepbiome weight decay method)
possible options
explanation
“phylogenetic_tree”
weight decay method based on the phylogenetic tree information with small amout of noise (\(\epsilon \le 1e-2\))
“phylogenetic_tree_wo_noise”
weight decay method based on the phylogenetic tree information without any noise outside the tree
- batch_normalization
options for adding the batch normalization for each convolutional layer (default = False)
- drop_out
options for adding the drop out for each convolutional layer with given ratio (default = 0)
Hint
Example of the combination of the options in the reference paper (url TBD):
training method |
combination of the options |
---|---|
DNN |
“weight_initial”=”glorot_uniform” |
DNN+L1 |
“weight_initial”=”glorot_uniform”, “weight_l1_penalty”=”0.01” |
DeepBiome |
“weight_initial”=”glorot_uniform”, “weight_deacy”=”phylogenetic_tree” |
network_info[‘training_info’]¶
Detailed options for the training_info field are as follows.
- epochs
number of the epoch for training (integer)
- batch_size
number of the batch size for each mini-batch (integer)
- callbacks
callback class implemented in Keras (See Callbacks)
possible options
explanation
“ModelCheckpoint”
save the best model weight based on the monitor (See ModelCheckpoint)
“EarlyStopping”
early stopping the training before the number of epochs epochs based on the monitor (See EarlyStopping)
- monitor
monitor value for the ModelCheckpoint, EarlyStoppoing callbacks (e.g. val_loss, val_accuray)
- mode
how to use the monitor value for the ModelCheckpoint, EarlyStopping callbacks
possible options
explanation
“min”
for example: when using the monitor val_loss
“max”
for example: when using the monitor val_accuray
- patience
patience for the EarlyStopping callback (integer; default = 20)
- min_delta
the minimum threshold for the ModelCheckpoint, EarlyStopping callbacks (float; default = 1e-4)
network_info[‘validation_info’]¶
Detailed options for the validation_info field are as follows.
- validation_size
the ratio of the number of the samples in the validation set / the number of the samples in the training set(e.g. “0.2”)
- batch_size
the batch size for each mini-batch. If “None”, use the whole number of the sample as one mini-batch. (defualt = “None”)
network_info[‘test_info’]¶
Detailed options for the test_info field are as follows.
- batch_size
the batch size for each mini-batch. If “None”, use the whole number of the sample as one mini-batch. (defualt = “None”)
Example for the network_info¶
This is the example of the configuration dictionary: network_info dictionary
network_info = {
'architecture_info': {
'batch_normalization': 'False',
'drop_out': '0',
'weight_initial': 'glorot_uniform',
'weight_l1_penalty':'0.01',
'weight_decay': 'phylogenetic_tree',
},
'model_info': {
'decay': '0.001',
'loss': 'binary_crossentropy',
'lr': '0.01',
'metrics': 'binary_accuracy, sensitivity, specificity, gmeasure, auc',
'network_class': 'DeepBiomeNetwork',
'normalizer': 'normalize_minmax',
'optimizer': 'adam',
'reader_class': 'MicroBiomeClassificationReader',
'taxa_selection_metrics': 'accuracy, sensitivity, specificity, gmeasure'
},
'training_info': {
'batch_size': '200',
'epochs': '10',
'callbacks': 'ModelCheckpoint',
'monitor': 'val_binary_accuracy',
'mode': 'max',
'min_delta': '1e-4',
},
'validation_info': {
'batch_size': 'None', 'validation_size': '0.2'
},
'test_info': {
'batch_size': 'None'
}
}
This is the example of the configuration file: network_info.cfg
[model_info]
network_class = DeepBiomeNetwork
optimizer = adam
lr = 0.01
decay = 0.0001
loss = binary_crossentropy
metrics = binary_accuracy, sensitivity, specificity, gmeasure, auc
texa_selection_metrics = accuracy, sensitivity, specificity, gmeasure
reader_class = MicroBiomeClassificationReader
normalizer = normalize_minmax
[architecture_info]
weight_initial = glorot_uniform
weight_decay = phylogenetic_tree
batch_normalization = False
drop_out = 0
[training_info]
epochs = 1000
batch_size = 200
callbacks = ModelCheckpoint
monitor = val_binary_accuracy
mode = max
min_delta = 1e-4
[validation_info]
validation_size = 0.2
batch_size = None
[test_info]
batch_size = None
Hint
See Example for reference about the configuration file example for various problems.
Preparing the configuration about the path information (path_info)¶
To give the information about the path to dataset, paths for saving the trained weights and the evaluation results, we provide a dictionary of configurations to the path_info feild. Alternatively we can also use the configufation file (.cfg).
Your configuration for the paths should include the information about:
- data_info
about the path information of the dataset
- model_info
about the path information for saving the trained weights and the evaluation results
Note
All paths are the relative path based on the directory where code will run.
path_info[‘data_info’]¶
To provide the dictionary as input, we can use the option below:
- tree_info_path
tree information file (.csv)
- count_list_path
lists of the name of input files (.csv)
- count_path
directory path of the input files
- y_path
y path (.csv) (not required for prediction)
- idx_path
index path for repetation (.csv)
- data_path
directory path of the index and y file
To provide one configuration file, we can use the options below:
- tree_info_path
tree information file (.csv)
- x_path
input path (.csv)
- y_path
y path (.csv) (not required for prediction)
- data_path
directory path of the index, x and y file
path_info[‘model_info’]¶
- weight
weight file name (.h5)
- evaluation
evaluation file name (.npy) (not required for prediction)
- model_dir
base directory path for the model (weight, evaluation)
- history
history file name for the history value of each evaluation metric from the training (.json). If not setted, deepbiome will not save the history of the network training.
Warning
If you want to use sub-directories in the path (for example, “weight”=”weight/weight.h5”, “history”=”history/hist.h5”, “model_dir”=”./”), you should have to make the sub-directories “./weight” and “./history” before running the code.
Example for the path_info for the list of inputs¶
This is the example of the configuration dictionary: path_info dictionary
path_info = {
'data_info': {
'count_list_path': 'data/simulation/gcount_list.csv',
'count_path': 'data/simulation/count',
'data_path': 'data/simulation/s2/',
'idx_path': 'data/simulation/s2/idx.csv',
'tree_info_path': 'data/genus48/genus48_dic.csv',
'x_path': '',
'y_path': 'y.csv'
},
'model_info': {
'model_dir': './simulation_s2/simulation_s2_deepbiome/',
'weight': 'weight/weight.h5',
'history': 'hist.json',
'evaluation': 'eval.npy'
}
}
This is the example of the configuration file: path_info.cfg
[data_info]
data_path = data/simulation/s2/
tree_info_path = data/genus48/genus48_dic.csv
idx_path = data/simulation/s2/idx.csv
count_list_path = data/simulation/gcount_list.csv
count_path = data/simulation/count
y_path = y.csv
[model_info]
model_dir = ./simulation_s2/simulation_s2_deepbiome/
weight = weight/weight.h5
history = historys/hist.json
evaluation = eval.npy
Example for the path_info for the one input file¶
This is the example of the configuration dictionary: path_info dictionary
path_info = {
'data_info': {
'data_path': '../../data/pulmonary/',
'tree_info_path': '../../data/genus48/genus48_dic.csv',
'x_path': 'X.csv',
'y_path': 'y.csv'
},
'model_info': {
'model_dir': './',
'weight': 'weight/weight.h5',
'history':'history/hist.json',
'evaluation': 'eval.npy',
}
}
This is the example of the configuration file: path_info.cfg
[data_info]
data_path = ../../data/pulmonary/
tree_info_path = ../../data/genus48/genus48_dic.csv
x_path = X.csv
y_path = y.csv
[model_info]
model_dir = ./
weight = weight/weight.h5
history = history/hist.json
evaluation = eval.npy
Hint
See Example for reference about the configuration file example for various problems.