Prerequisites

Data preprocessing

DeepBiome package takes microbiome abundance data as input and uses the phylogenetic taxonomy to guide the decision of the optimal number of layers and neurons in the deep learning architecture.

To use DeepBiome, you can experiment (1) `k` times repetition or (2) `k` fold cross-validation. For each experiment, we asuume that the dataset is given by

  1. A list of `k` input files for `k` times repetition., or

  2. One input file for `k` fold cross-validation.

With a list of k inputs for k times repetition

DeepBiome needs 4 data files as follows:

  1. the tree information

  2. the list of the input files (each file has all sample’s information for one repetition)

  3. the list of the names of input files

  4. y

For k times repetition, we can use the list of k input files. Each file has all sample’s information for one repetition. In addition, we can set the training index for each repetition. If we set the index file, DeepBiome builds the training set for each repetition based on each fold index in the index file. If not, DeepBiome will generate the index file locally.

Eath data should have the csv format as follows:

tree information (.csv)

A file about the phylogenetic tree information. Below is an example file of the phylogenetic tree information dictionary

Example of genus dictionary

Genus

Family

Order

Class

Phylum

Domain

Streptococcus

Streptococcaceae

Lactobacillales

Bacilli

Firmicutes

Bacteria

Tropheryma

Cellulomonadaceae

Actinomycetales

Actinobacteria

Actinobacteria

Bacteria

Veillonella

Veillonellaceae

Selenomonadales

Negativicutes

Firmicutes

Bacteria

Actinomyces

Actinomycetaceae

Actinomycetales

Actinobacteria

Actinobacteria

Bacteria

Flavobacterium

Flavobacteriaceae

Flavobacteriales

Flavobacteria

Bacteroidetes

Bacteria

Prevotella

Prevotellaceae

Bacteroidales

Bacteroidia

Bacteroidetes

Bacteria

Porphyromonas

Porphyromonadaceae

Bacteroidales

Bacteroidia

Bacteroidetes

Bacteria

Parvimonas

Clostridiales_Incertae_Sedis_XI

Clostridiales

Clostridia

Firmicutes

Bacteria

list of the names of k input files (.csv)

If we want to use the list of the input files, we need the make a list of the names of each input file. Below is an example file for k=4 repetition.

Example of the list of k input file names

gcount_0001.csv

gcount_0002.csv

gcount_0003.csv

gcount_0004.csv

list of k input files

Each file should have each repetition’s sample microbiome abandunce. Below is an example file for k=4 repetition. This example is gcount_0001.csv for the first repetition in the list of the names of input files above. This file has the 4 samples’ microbiome abandunce.

Example of one input file (.csv) of k inputs.

Streptococcus

Tropheryma

Veillonella

Actinomyces

Flavobacterium

Prevotella

Porphyromonas

Parvimonas

Fusobacterium

Propionibacterium

Gemella

Rothia

Granulicatella

Neisseria

Lactobacillus

Megasphaera

Catonella

Atopobium

Campylobacter

Capnocytophaga

Solobacterium

Moryella

TM7_genera_incertae_sedis

Staphylococcus

Filifactor

Oribacterium

Burkholderia

Sneathia

Treponema

Moraxella

Haemophilus

Selenomonas

Corynebacterium

Rhizobium

Bradyrhizobium

Methylobacterium

OD1_genera_incertae_sedis

Finegoldia

Microbacterium

Sphingomonas

Chryseobacterium

Bacteroides

Bdellovibrio

Streptophyta

Lachnospiracea_incertae_sedis

Paracoccus

Fastidiosipila

Pseudonocardia

841

0

813

505

5

3224

0

362

11

65

156

1

55

0

1

20

382

1

333

24

80

43

309

2

3

4

0

1

32

0

2

4

382

0

0

96

23

0

0

87

0

0

0

0

0

0

0

2133

1445

0

1

573

0

1278

82

85

69

154

436

3

0

61

440

0

394

83

33

123

0

49

414

0

0

37

0

0

42

0

0

384

27

0

0

0

146

0

0

1

2

0

0

0

0

0

0

3638

1259

0

805

650

0

1088

0

0

74

0

155

228

430

765

0

0

11

102

68

90

77

83

322

10

0

7

0

122

76

0

1

25

0

0

0

44

13

0

0

2

8

1

39

0

0

0

0

3445

982

0

327

594

0

960

81

19

9

0

45

457

1049

0

3

450

19

170

388

147

0

0

41

63

0

1

0

0

121

0

0

1

0

0

0

0

344

0

157

1

0

4

60

0

0

0

0

3507

y (.csv)

One column contains y samples for one repetition. Below is an example file for k=4 repetition. For each repetition (column) has outputs of 4 samples for each repeatition.

Example of y file (.csv)

V1

V2

V3

V4

1.0

0.0

1.0

1.0

1.0

1.0

1.0

1.0

0.0

1.0

0.0

1.0

0.0

1.0

1.0

0.0

index for training set for each repetition (.csv)

For each repetition, we have to set the training and test set. If the index file is given, DeepBiome sets the training set and test set based on the index file. Below is the example of the index file. Each column has the training indices for each repetition. DeepBiome will only use the samples in this index set for training. Below is an example for k=4 repetition

Example of index file (.csv)

V1

V2

V3

V4

0

1

2

3

1

2

3

0

2

3

0

1

In the example above, we used the first 3 rows of the first column in y.csv for the training set in the first repetition.

With one input file for k fold cross-validation

DeepBiome needs 3 data files as follows:

  1. the tree information

  2. the input file

  3. y

For k fold cross-validation, we can use an input file. In addition, we can set the training index for each fold. If we set the index file, DeepBiome builds the training set for each fold based on each fold index in the index file. If not, DeepBiome will generate the index file locally.

Eath data should have the csv format as follows:

tree information (.csv)

A file about the phylogenetic tree information. Below is an example file of the phylogenetic tree information dictionary

Example of genus dictionary

Genus

Family

Order

Class

Phylum

Domain

Streptococcus

Streptococcaceae

Lactobacillales

Bacilli

Firmicutes

Bacteria

Tropheryma

Cellulomonadaceae

Actinomycetales

Actinobacteria

Actinobacteria

Bacteria

Veillonella

Veillonellaceae

Selenomonadales

Negativicutes

Firmicutes

Bacteria

Actinomyces

Actinomycetaceae

Actinomycetales

Actinobacteria

Actinobacteria

Bacteria

Flavobacterium

Flavobacteriaceae

Flavobacteriales

Flavobacteria

Bacteroidetes

Bacteria

Prevotella

Prevotellaceae

Bacteroidales

Bacteroidia

Bacteroidetes

Bacteria

Porphyromonas

Porphyromonadaceae

Bacteroidales

Bacteroidia

Bacteroidetes

Bacteria

Parvimonas

Clostridiales_Incertae_Sedis_XI

Clostridiales

Clostridia

Firmicutes

Bacteria

input file

Input file has the microbiome abandunce of each sample. Below is an example file with the 4 samples’ microbiome abandunce.

Example of input file (.csv)

Streptococcus

Tropheryma

Veillonella

Actinomyces

Flavobacterium

Prevotella

Porphyromonas

Parvimonas

Fusobacterium

Propionibacterium

Gemella

Rothia

Granulicatella

Neisseria

Lactobacillus

Megasphaera

Catonella

Atopobium

Campylobacter

Capnocytophaga

Solobacterium

Moryella

TM7_genera_incertae_sedis

Staphylococcus

Filifactor

Oribacterium

Burkholderia

Sneathia

Treponema

Moraxella

Haemophilus

Selenomonas

Corynebacterium

Rhizobium

Bradyrhizobium

Methylobacterium

OD1_genera_incertae_sedis

Microbacterium

Sphingomonas

Chryseobacterium

Bdellovibrio

Streptophyta

Finegoldia

Bacteroides

Lachnospiracea_incertae_sedis

Paracoccus

Fastidiosipila

Pseudonocardia

6244.0

0.0

2985.0

204.0

5.0

3548.0

308.0

53.0

506.0

3.0

324.0

669.0

795.0

0.0

3686.0

6.0

41.0

609.0

11.0

0.0

3.0

0.0

0.0

31.0

0.0

0.0

3.0

0.0

93.0

0.0

1.0

9.0

0.0

0.0

0.0

3.0

2.0

5.0

0.0

0.0

0.0

2.0

0.0

0.0

0.0

0.0

0.0

0.0

3573.0

3.0

2566.0

975.0

0.0

7195.0

377.0

39.0

442.0

69.0

602.0

45.0

527.0

4536.0

324.0

105.0

626.0

400.0

1130.0

1132.0

127.0

9.0

426.0

16.0

0.0

150.0

1.0

244.0

248.0

0.0

19.0

26.0

9.0

0.0

0.0

14.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

516.0

0.0

227.0

39.0

0.0

180.0

54.0

38.0

80.0

2.0

84.0

26.0

18.0

0.0

2235.0

4.0

3.0

3.0

8.0

2.0

10.0

2.0

7.0

0.0

0.0

0.0

0.0

78.0

7.0

0.0

4.0

6.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

1.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

881.0

1.0

47.0

4.0

0.0

13.0

5.0

1.0

108.0

1.0

7.0

6.0

6.0

117.0

0.0

0.0

2.0

0.0

4.0

6.0

0.0

5.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

9.0

3.0

6.0

0.0

0.0

0.0

2.0

0.0

0.0

0.0

0.0

1.0

0.0

0.0

0.0

0.0

0.0

0.0

y (.csv)

Below is an example file of the outputs of 4 samples.

Example of y file (.csv)

y

0.78

1.01

0.92

0.91

index for training set for each fold (.csv)

For each fold, we have to set the training and test set. If the index file is given, DeepBiome sets the training set and test set based on the index file. Below is an example of the index file. Each column has the training indices for each fold. DeepBiome will only use the samples in this index set for training. Below is an example for k=4 fold

Example of index file (.csv)

V1

V2

V3

V4

0

1

2

3

1

2

3

0

2

3

0

1

In the example above, we used the first 3 rows of the first column in y.csv for the training set in the first fold.

Configuration

For detailed configuration, we used python dictionary as inputs for the main training function.

Preparing the configuration about the network information (network_info)

To give the information about the training hyper-parameter, we provide a dictionary of configuration to the netowrk_info field. Alternatively we can use the configufation file (.cfg).

Configuration for the network training should include the information about:

model_info

about the training method and metrics

architecture_info

about the architecture options

training_info

about the hyper-parameter for training (not required for testing and prediction)

validation_info

about the hyper-parameter for validation (not required for testing and prediction)

test_info

about the hyper-parameter for testing

Note

You don’t have to fill the options if it has a default value.

network_info[‘model_info’]

Detailed options for the model_info field are as follows.

network_clas

DeepBiome network class (default=’DeepBiomeNetwork’).

reader_class

reader classes

possible options

explanation

“MicroBiomeRegressionReader”

Microbiome adandunce data reader for regression problem

“MicroBiomeClassificationReader”

Microbiome adandunce data reader for classification problem

optimizer

optimization methods for training the network. We used the optimizers implemented in Keras (See Optimizer).

possible options

explanation

“adam”

Adam optimizer

“sgd”

stocastic gradient decent optimizer

lr

learning rate for the optimizor. (float between 0 ~ 1)

decay

learning late decay ratio for the optimizer. (float between 0 ~ 1)

loss

loss functions for training the network

possible options

explanation

“mean_squared_error”

for regression problem

“binary_crossentropy”

for binary classification problem

“categorical_crossentropy”

for multi-class classification problem

metrics

additional metrics to check the model performance

possible options

explanation

“correlation_coefficient”

Pearson correlation coefficient (-1 ~ 1)

“binary_accuracy”

Accuracy for binary classification problem (0 ~ 1)

“categorical_accuracy”

Accuracy for multi-class classification problem (0 ~ 1)

“sensitivity”

Sensitivity (0 ~ 1)

“specificity”

Specificity (0 ~ 1)

“gmeasure”

(Sensitivity * Specificity) ^ (0.5) (0 ~ 1)

“auc”

Area under the receiver operating characteristics (0 ~ 1)

“precision”

Precision (0 ~ 1)

“recall”

Recall (0 ~ 1)

“f1”

F1 score (0 ~ 1)

taxa_selection_metrics

metrics for the texa selection performance

possible options

explanation

“accuracy”

Accuracy (-1 ~ 1)

“sensitivity”

Sensitivity (0 ~ 1)

“specificity”

Specificity (0 ~ 1)

“gmeasure”

(Sensitivity * Specificity) ^ (0.5) (0 ~ 1)

normalizer

normalizer for the input data (default=`normalize_minmax`)

network_info[‘architecture_info’]

Detailed options for the architecture_info field are as follows.

Combination of the options below will provide you the network training method DNN, DNN+L1 and Deepbiome in the reference (url. TBD)

weight_initial

network weight initialization

possible options

explanation

“glorot_uniform”

Glorot uniform initializer (defualt)

“he_normal”

He normal initializer

“phylogenetic_tree”

weight within the tree connection: 1; weight without the tree connection: 0

“phylogenetic_tree_glorot_uniform”

weight within the tree connection: glorot_uniform; weight without the tree connection: 0

“phylogenetic_tree_he_normal”

weight within the tree connection: he_normal; weight without the tree connection: 0

weight_l1_penalty

\(\lambda\) for l1 penalty (float. defaut = 0)

weight_l2_penalty

\(\lambda\) for l2 penalty (float. defaut = 0)

weight_decay

DeepBiome with the phylogenetic tree based weight decay method (default = “”: without deepbiome weight decay method)

possible options

explanation

“phylogenetic_tree”

weight decay method based on the phylogenetic tree information with small amout of noise (\(\epsilon \le 1e-2\))

“phylogenetic_tree_wo_noise”

weight decay method based on the phylogenetic tree information without any noise outside the tree

batch_normalization

options for adding the batch normalization for each convolutional layer (default = False)

drop_out

options for adding the drop out for each convolutional layer with given ratio (default = 0)

Hint

Example of the combination of the options in the reference paper (url TBD):

training method

combination of the options

DNN

“weight_initial”=”glorot_uniform”

DNN+L1

“weight_initial”=”glorot_uniform”, “weight_l1_penalty”=”0.01”

DeepBiome

“weight_initial”=”glorot_uniform”, “weight_deacy”=”phylogenetic_tree”

network_info[‘training_info’]

Detailed options for the training_info field are as follows.

epochs

number of the epoch for training (integer)

batch_size

number of the batch size for each mini-batch (integer)

callbacks

callback class implemented in Keras (See Callbacks)

possible options

explanation

“ModelCheckpoint”

save the best model weight based on the monitor (See ModelCheckpoint)

“EarlyStopping”

early stopping the training before the number of epochs epochs based on the monitor (See EarlyStopping)

monitor

monitor value for the ModelCheckpoint, EarlyStoppoing callbacks (e.g. val_loss, val_accuray)

mode

how to use the monitor value for the ModelCheckpoint, EarlyStopping callbacks

possible options

explanation

“min”

for example: when using the monitor val_loss

“max”

for example: when using the monitor val_accuray

patience

patience for the EarlyStopping callback (integer; default = 20)

min_delta

the minimum threshold for the ModelCheckpoint, EarlyStopping callbacks (float; default = 1e-4)

network_info[‘validation_info’]

Detailed options for the validation_info field are as follows.

validation_size

the ratio of the number of the samples in the validation set / the number of the samples in the training set(e.g. “0.2”)

batch_size

the batch size for each mini-batch. If “None”, use the whole number of the sample as one mini-batch. (defualt = “None”)

network_info[‘test_info’]

Detailed options for the test_info field are as follows.

batch_size

the batch size for each mini-batch. If “None”, use the whole number of the sample as one mini-batch. (defualt = “None”)

Example for the network_info

This is the example of the configuration dictionary: network_info dictionary

network_info = {
    'architecture_info': {
        'batch_normalization': 'False',
        'drop_out': '0',
        'weight_initial': 'glorot_uniform',
        'weight_l1_penalty':'0.01',
        'weight_decay': 'phylogenetic_tree',
    },
    'model_info': {
        'decay': '0.001',
        'loss': 'binary_crossentropy',
        'lr': '0.01',
        'metrics': 'binary_accuracy, sensitivity, specificity, gmeasure, auc',
        'network_class': 'DeepBiomeNetwork',
        'normalizer': 'normalize_minmax',
        'optimizer': 'adam',
        'reader_class': 'MicroBiomeClassificationReader',
        'taxa_selection_metrics': 'accuracy, sensitivity, specificity, gmeasure'
    },
    'training_info': {
        'batch_size': '200',
        'epochs': '10',
        'callbacks': 'ModelCheckpoint',
        'monitor': 'val_binary_accuracy',
        'mode': 'max',
        'min_delta': '1e-4',
    },
    'validation_info': {
        'batch_size': 'None', 'validation_size': '0.2'
    },
    'test_info': {
        'batch_size': 'None'
    }
}

This is the example of the configuration file: network_info.cfg

[model_info]
network_class = DeepBiomeNetwork
optimizer   = adam
lr          = 0.01
decay       = 0.0001
loss        = binary_crossentropy
metrics     = binary_accuracy, sensitivity, specificity, gmeasure, auc
texa_selection_metrics = accuracy, sensitivity, specificity, gmeasure
reader_class = MicroBiomeClassificationReader
normalizer  = normalize_minmax

[architecture_info]
weight_initial = glorot_uniform
weight_decay = phylogenetic_tree
batch_normalization = False
drop_out = 0

[training_info]
epochs          = 1000
batch_size      = 200
callbacks       = ModelCheckpoint
monitor         = val_binary_accuracy
mode            = max
min_delta       = 1e-4

[validation_info]
validation_size = 0.2
batch_size = None

[test_info]
batch_size = None

Hint

See Example for reference about the configuration file example for various problems.

Preparing the configuration about the path information (path_info)

To give the information about the path to dataset, paths for saving the trained weights and the evaluation results, we provide a dictionary of configurations to the path_info feild. Alternatively we can also use the configufation file (.cfg).

Your configuration for the paths should include the information about:

data_info

about the path information of the dataset

model_info

about the path information for saving the trained weights and the evaluation results

Note

All paths are the relative path based on the directory where code will run.

path_info[‘data_info’]

To provide the dictionary as input, we can use the option below:

tree_info_path

tree information file (.csv)

count_list_path

lists of the name of input files (.csv)

count_path

directory path of the input files

y_path

y path (.csv) (not required for prediction)

idx_path

index path for repetation (.csv)

data_path

directory path of the index and y file

To provide one configuration file, we can use the options below:

tree_info_path

tree information file (.csv)

x_path

input path (.csv)

y_path

y path (.csv) (not required for prediction)

data_path

directory path of the index, x and y file

path_info[‘model_info’]

weight

weight file name (.h5)

evaluation

evaluation file name (.npy) (not required for prediction)

model_dir

base directory path for the model (weight, evaluation)

history

history file name for the history value of each evaluation metric from the training (.json). If not setted, deepbiome will not save the history of the network training.

Warning

If you want to use sub-directories in the path (for example, “weight”=”weight/weight.h5”, “history”=”history/hist.h5”, “model_dir”=”./”), you should have to make the sub-directories “./weight” and “./history” before running the code.

Example for the path_info for the list of inputs

This is the example of the configuration dictionary: path_info dictionary

path_info = {
    'data_info': {
        'count_list_path': 'data/simulation/gcount_list.csv',
        'count_path': 'data/simulation/count',
        'data_path': 'data/simulation/s2/',
        'idx_path': 'data/simulation/s2/idx.csv',
        'tree_info_path': 'data/genus48/genus48_dic.csv',
        'x_path': '',
        'y_path': 'y.csv'
    },
    'model_info': {
        'model_dir': './simulation_s2/simulation_s2_deepbiome/',
        'weight': 'weight/weight.h5',
        'history': 'hist.json',
        'evaluation': 'eval.npy'
    }
}

This is the example of the configuration file: path_info.cfg

[data_info]
data_path = data/simulation/s2/
tree_info_path = data/genus48/genus48_dic.csv
idx_path = data/simulation/s2/idx.csv
count_list_path = data/simulation/gcount_list.csv
count_path = data/simulation/count
y_path = y.csv

[model_info]
model_dir = ./simulation_s2/simulation_s2_deepbiome/
weight = weight/weight.h5
history = historys/hist.json
evaluation = eval.npy

Example for the path_info for the one input file

This is the example of the configuration dictionary: path_info dictionary

path_info = {
    'data_info': {
        'data_path': '../../data/pulmonary/',
        'tree_info_path': '../../data/genus48/genus48_dic.csv',
        'x_path': 'X.csv',
        'y_path': 'y.csv'
    },
    'model_info': {
        'model_dir': './',
        'weight': 'weight/weight.h5',
        'history':'history/hist.json',
        'evaluation': 'eval.npy',
    }
}

This is the example of the configuration file: path_info.cfg

[data_info]
data_path = ../../data/pulmonary/
tree_info_path = ../../data/genus48/genus48_dic.csv
x_path = X.csv
y_path = y.csv

[model_info]
model_dir = ./
weight = weight/weight.h5
history = history/hist.json
evaluation = eval.npy

Hint

See Example for reference about the configuration file example for various problems.