# Keras QuickRef

Keras is a high-level neural networks API, written in Python that runs on top of the Deep Learning framework TensorFlow. In fact,

**tf.keras** will be integrated directly into TensorFlow 1.2 !

Here are my API notes:

## Model API

```
summary()
get_config()
from_config(config)
set_weights()
set_weights(weights)
to_json()
to_yaml()
save_weights(filepath)
load_weights(filepath, by_name)
layers
```

## Model Sequential /Functional APIs

```
add(layer)
compile(optimizer, loss, metrics, sample_weight_mode)
fit(x, y, batch_size, nb_epoch, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight)
evaluate(x, y, batch_size, verbose, sample_weight)
predict(x, batch_size, verbose)
predict_classes(x, batch_size, verbose)
predict_proba(x, batch_size, verbose)
train_on_batch(x, y, class_weight, sample_weight)
test_on_batch(x, y, class_weight)
predict_on_batch(x)
fit_generator(generator, samples_per_epoch, nb_epoch, verbose, callbacks, validation_data, nb_val_samples, class_weight, max_q_size, nb_worker, pickle_safe)
evaluate_generator(generator, val_samples, max_q_size, nb_worker, pickle_safe)
predict_generator(generator, val_samples, max_q_size, nb_worker, pickle_safe)
get_layer(name, index)
```

## Layers

### Core

Layer | description | IO | params |
---|

Dense | vanilla fully connected NN layer | (nb_samples, input_dim) --> (nb_samples, output_dim) | `output_dim/shape, init, activation, weights, W_regularizer, b_regularizer, activity_regularizer, W_constraint, b_constraint, bias, input_dim/shape` |

Activation | Applies an activation function to an output | TN --> TN | `activation` |

Dropout | randomly set fraction p of input units to 0 at each update during training time --> reduce overfitting | TN --> TN | `p` |

SpatialDropout2D/3D | dropout of entire 2D/3D feature maps to counter pixel / voxel proximity correlation | (samples, rows, cols, [stacks,] channels) --> (samples, rows, cols, [stacks,] channels) | `p` |

Flatten | Flattens the input to 1D | (nb_samples, D1, D2, D3) --> (nb_samples, D1xD2xD3) | - |

Reshape | Reshapes an output to a different factorization | eg (None, 3, 4) --> (None, 12) or (None, 2, 6) | `target_shape` |

Permute | Permutes dimensions of input - output_shape is same as the input shape, but with the dimensions re-ordered | eg (None, A, B) --> (None, B, A) | `dims` |

RepeatVector | Repeats the input n times | (nb_samples, features) --> (nb_samples, n, features) | `n` |

Merge | merge a list of tensors into a single tensor | [TN] --> TN | `layers, mode, concat_axis, dot_axes, output_shape, output_mask, node_indices, tensor_indices, name` |

Lambda | TensorFlow expression | flexible | `function, output_shape, arguments` |

ActivityRegularization | regularize the cost function | TN --> TN | `l1, l2` |

Masking | identify timesteps in D1 to be skipped | TN --> TN | `mask_value` |

Highway | LSTM for FFN ? | (nb_samples, input_dim) --> (nb_samples, output_dim) | same as Dense + `transform_bias` |

MaxoutDense | takes the element-wise maximum of prev layer - to learn a convex, piecewise linear activation function over the inputs ?? | (nb_samples, input_dim) --> (nb_samples, output_dim) | same as Dense + `nb_feature` |

TimeDistributed | Apply a Dense layer for each D1 time_dimension | (nb_sample, time_dimension, input_dim) --> (nb_sample, time_dimension, output_dim) | `Dense` |

### Convolutional

Layer | description | IO | params |
---|

Convolution1D | filter neighborhoods of 1D inputs | (samples, steps, input_dim) --> (samples, new_steps, nb_filter) | `nb_filter, filter_length, init, activation, weights, border_mode, subsample_length, W_regularizer, b_regularizer, activity_regularizer, W_constraint, b_constraint, bias, input_dim, input_length` |

Convolution2D | filter neighborhoods of 2D inputs | (samples, rows, cols, channels) --> (samples, new_rows, new_cols, nb_filter) | like Convolution1D + `nb_row, nb_col` instead of `filter_length` , `subsample, dim_ordering` |

AtrousConvolution1/2D | dilated convolution with holes | same as Convolution2D | same as Convolution1/2D + `atrous_rate` |

SeparableConvolution2D | first does a depth 1st spatial convolution on each input channel separately, then a pointwise convolution which mixes together the resulting output channels. | same as Convolution2D | same as Convolution2D + `depth_multiplier, depthwise_regularizer, pointwise_regularizer, depthwise_constraint, pointwise_constraint` |

Deconvolution2D | Transposed convolution ??? | | |

Convolution3D | | (samples, conv_dim1, conv_dim2, conv_dim3, channels) --> (samples, new_conv_dim1, new_conv_dim2, new_conv_dim3, nb_filter) | `kernel_dim1, kernel_dim2, kernel_dim3` |

Cropping1D/2D/3D | crops along the dimension(s) | (samples, depth, [axes_to_crop]) -->(samples, depth, [cropped_axes]) | `cropping, dim_order` |

UpSampling1D/2D/3D | Repeat each step x times along the specified axes | (samples, [dims], channels) --> (samples, [upsampled_dims], channels) | `size, dim_order` |

ZeroPadding1/2/3D | 0 padding | (samples, [dims], channels) --> (samples, [padded_dims], channels) | `padding, dim_order` |

### Pooling && Locally Connected

Layer | description | IO | params |
---|

Max/AveragePooling1/2/3D | downscale to max / average | (samples, [len_pool_dimN], channels) -->(samples, [pooled_dimN], channels) | `pool_size, strides, border_mode, dim_ordering` |

GlobalMax/GlobalAveragePooling1/2D | downscale to max / average | (samples, [len_pool_dimN], channels) -->(samples, [pooled_dimN], channels) | `dim_ordering` |

Locally Connected1D/2D | similarly to ConvolutionxD but weights are unshared - different filters applied at each patch | | like ConvolutionxD + `subsample` |

### Recurrent

Layer | description | IO | params |
---|

Recurrent | abstract base class | (nb_samples, timesteps, input_dim) --> (return_sequences)?(nb_samples, timesteps, output_dim):(nb_samples, output_dim) | `weights, return_sequences, go_backwards, stateful, unroll, consume_less, input_dim, input_length` |

SimpleRNN | Fully-connected RNN where output is fed back as input | like Recurrent | Recurrent + `output_dim, init, inner_init, activation, W_regularizer, U_regularizer, b_regularizer, dropout_W, dropout_U` |

GRU | Gated Recurrent Unit | like Recurrent | like SimpleRNN |

LSTM | Long-Short Term Memory unit | like Recurrent | like SimpleRNN |

### Misc

Layer | description | IO | params |
---|

Embedded | Turn positive integers (indexes) into dense vectors of fixed size | (nb_samples, sequence_length) --> (nb_samples, sequence_length, output_dim) | `input_dim, output_dim, init, input_length, W_regularizer, activity_regularizer, W_constraint, mask_zero, weights, dropout` |

BatchNormalization | at each batch, normalize activations of previous layer (mean:0, sd: 1) | TN --> TN | `epsilon, mode, axis, momentum, weights, beta_init, gamma_init, gamma_regularizer, beta_regularizer` |

### Activation

Layer | description | IO | params |
---|

LeakyReLU | ReLU that allows a small gradient when unit is inactive: `f(x) = alpha * x for x < 0, f(x) = x for x >= 0` | TN --> TN | `alpha` |

PReLU | Parametric ReLU - gradient is a learned array: f(x) = alphas * x for x < 0, f(x) = x for x >= 0 | TN --> TN | `init, weights` |

ELU | Exponential Linear Unit: `f(x) = alpha * (exp(x) - 1.) for x < 0, f(x) = x for x >= 0` | TN --> TN | `alpha` |

ParametricSoftplus | `alpha * log(1 + exp(beta * x))` | TN --> TN | `alpha, beta` |

ThresholdedReLU | `f(x) = x for x > theta f(x) = 0 otherwise` | TN --> TN | `theta` |

SReLU | S-shaped ReLU | TN --> TN | `t_left_init, a_left_init, t_right_init, a_right_init` |

### Noise

Layer | description | IO | params |
---|

GaussianNoise | mitigate overfitting by smoothing: 0-centered Gaussian noise with standard deviation sigma | TN --> TN | `sigma` |

GaussianDropout | mitigate overfitting by smoothing: 0-centered Gaussian noise with standard deviation `sqrt(p/(1-p))` | TN --> TN | `p` |

# Preprocessing

type | name | transform | params |
---|

sequence | pad_sequences | list of `nb_samples` scalar sequence --> 2D array of shape `(nb_samples, nb_timesteps)` | `sequences, maxlen, dtype` |

| skipgrams | word index list of int --> list of (word,word) | `sequence, vocabulary_size, window_size, negative_samples, shuffle, categorical, sampling_table` |

| make_sampling_table | generate word index array of shape (size,) for skipgrams | `size, sampling_factor` |

Text | text_to_word_sequence | sentence --> list of words | `text, filters, lower, split` |

| one_hot | text --> list of n word indexes | `text, n, filters, lower, split` |

| Tokenizer | text --> list of word indexes | `nb_words, filters, lower, split` |

image | ImageDataGenerator | batches of image tensors | `featurewise_center, samplewise_center, featurewise_std_normalization, samplewise_std_normalization,zca_whitening, rotation_range,width_shift_range, height_shift_range,shear_range,zoom_range,channel_shift_range, fill_mode, cval, horizontal_flip, vertical_flip, rescale, dim_ordering` |

# Objectives (Loss Functions)

- mean_squared_error / mse
- mean_absolute_error / mae
- mean_absolute_percentage_error / mape
- mean_squared_logarithmic_error / msle
- squared_hinge
- hinge
- binary_crossentropy (logloss)
- categorical_crossentropy (multiclass logloss) - requires labels be binary arrays of shape
`(nb_samples, nb_classes)`

- sparse_categorical_crossentropy As above but accepts sparse labels
- kullback_leibler_divergence / kld Information gain from a predicted probability distribution Q to a true probability distribution P
- poisson Mean of
`(predictions - targets * log(predictions))`

- cosine_proximity negative mean cosine proximity between predictions and targets

# metrics

- binary_accuracy - for binary classification
- categorical_accuracy -for multiclass classification
- sparse_categorical_accuracy
- top_k_categorical_accuracy - when the target class is within the top-k predictions provided
- mean_squared_error (mse) - for regression
- mean_absolute_error (mae)
- mean_absolute_percentage_error (mape)
- mean_squared_logarithmic_error (msle)
- hinge - hinge loss: `max(1 - y_true * y_pred, 0)``
- squared_hinge hinge ^ 2
- categorical_crossentropy - for multiclass classification
- sparse_categorical_crossentropy
- binary_crossentropy -for binary classification
- kullback_leibler_divergence
- poisson
- cosine_proximity
- matthews_correlation - for quality of binary classification
- fbeta_score - weighted harmonic mean of precision and recall in multi-label classification

# Optimizers

- SGD - Stochastic gradient descent, with support for momentum, learning rate decay, and Nesterov momentum
- RMSProp - good for RNNs
- Adagrad
- AdaDelta
- AdaMax
- Adam
- Nadam

# Activation Functions

- softmax
- softplus
- softsign
- relu
- tanh
- sigmoid
- hard_sigmoid
- linear

# Callbacks

name | description | params |
---|

Callback | abstract base class - hooks: `on_epoch_end` , `on_batch_start` , `on_batch_end` | |

BaseLogger | accumulates epoch averages of metrics being monitored | |

ProgbarLogger | writes to stdout | |

History | records events into a History object (automatic) | |

ModelCheckpoint | Save model after every epoch, according to monitored quantity | `filepath, monitor, verbose, save_best_only, save_weights_only, mode` |

EarlyStopping | stop training when a monitored quantity has stopped improving after patience | `monitor, min_delta, patience, verbose, mode` |

RemoteMonitor | stream events to a server | `root, path, field` |

LearningRateScheduler | ? | `schedule` |

TensorBoard | write a log for TensorBaord to visualize | `log_dir, histogram_freq, write_graph, write_images` |

ReduceLROnPlateau | Reduce learning rate when a metric has stopped improving | `monitor, factor, patience, verbose, mode, epsilon, cooldown, min_lr` |

CSVLogger | stream epoch results to a csv file | `filename, separator, append` |

LambdaCallback | custom callback | `on_epoch_begin, on_epoch_end, on_batch_begin, on_batch_end, on_train_begin, on_train_end` |

# Init Functions

- uniform
- lecun_uniform
- identity
- orthogonal
- zero
- glorot_normal - Gaussian initialization * **scaled by fan_in + fan_out
- glorot_uniform
- he_uniform

# Regulizers

## arguments

- W_regularizer, b_regularizer (WeightRegularizer)
- activity_regularizer (ActivityRegularizer)

## penalties:

- l1 - LASSO
- l2 - weight decay, Ridge
- l1l2 - ElasticNet

# Constraints

## arguments

- W_constraint - for the main weights matrix
- b_constraint for bias

## constraints

- maxnorm - maximum-norm
- nonneg - non-negativity
- unitnorm - unit-norm

# Tuning Hyper-Parameters:

- batch size
- number of epochs
- training optimization algorithm
- Learning Weight
- momentum
- network weight initialization
- activation function
- dropout regularization
- number of neurons in a hidden layer
- depth of hidden layers