There are many applications of single cell RNA-seq and the computational challenges that arise from this complex data. In this project, we will work directly with scRNA-seq data in a real-world application.
Background on the Dataset
Peripheral blood mononuclear cells (PBMCs) are a group of different cell types in our blood, a subset of what we call white blood cells. They include many important parts of our immune system: T cells, B cells, natural killer cells, and so on. These cells are responsible for our innate and adaptive immune responses. When we get infected or vaccinated, they kick into high gear. In different situations, or with different diseases, the mix of different cell types shifts and so does the gene expression within one type of cell. PBMCs can be isolated very easily from patient blood samples, just by spinning the blood in a centrifuge that separates different cell types by weight. So, looking at these cells can be very informative.
Part I. Autoencoder
Starting from scRNA-seq data from PBMCs, we will look at the cells in a lower-dimension space. We will first implement an autoencoder to find a latent space representation of our data. Then, we will compare two-dimensional representations of our data using t-SNE, PCA, and the latent space defined by our autoencoder.
Autoencoder ¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import keras
from keras import layers
This data associates a cell barcode, such as “AAAGCCTGGCTAAC-1”, to a certain cell type label, such as “CD14+ Monocyte”. For each cell barcode, there are also log RNA seq counts of 765 different genes, such as HES4.
label.csv stores the association between a cell barcode and a cell type label.
processed_counts.csv stores the normalized log read counts for each cell, where each row represents a single cell, and each column represents a gene.
labels_pd = pd.read_csv("labels.csv", index_col=0)
counts_pd = pd.read_csv("processed_counts.csv", index_col=0)
labels_pd.head()
bulk_labels | |
---|---|
index | |
AAAGCCTGGCTAAC-1 | CD14+ Monocyte |
AAATTCGATGCACA-1 | Dendritic |
AACACGTGGTCTTT-1 | CD56+ NK |
AAGTGCACGTGCTA-1 | CD4+/CD25 T Reg |
ACACGAACGGAGTG-1 | Dendritic |
counts_pd.head()
HES4 | TNFRSF4 | SSU72 | PARK7 | RBP7 | … | SUMO3 | ITGB2 | S100B | PRMT2 | MT-ND3 | |
---|---|---|---|---|---|---|---|---|---|---|---|
AAAGCCTGGCTAAC-1 | -0.326 | -0.191 | -0.728 | -0.301 | 3.386 | … | 4.294 | 0.519 | -0.21 | -0.636 | 4.011 |
AAATTCGATGCACA-1 | 1.171 | -0.191 | 0.795 | -1.200 | -0.174 | … | -0.585 | 1.172 | -0.21 | 2.630 | -0.490 |
AACACGTGGTCTTT-1 | -0.326 | -0.191 | 0.483 | -1.200 | -0.174 | … | -0.585 | 0.722 | -0.21 | 0.663 | -0.490 |
AAGTGCACGTGCTA-1 | -0.326 | -0.191 | 1.134 | -0.157 | -0.174 | … | -0.585 | 0.766 | -0.21 | -0.636 | -0.490 |
ACACGAACGGAGTG-1 | -0.326 | -0.191 | -0.728 | -0.607 | -0.174 | … | -0.585 | -0.007 | -0.21 | -0.636 | -0.490 |
5 rows × 765 columns
Split into train and test sets (80:20 split)
counts_shuffled = counts_pd.sample(frac=1)
split = np.random.rand(len(counts_shuffled)) < 0.8
counts_train = counts_shuffled[split]
counts_test = counts_shuffled[~split]
Create a fully connected neural network for our autoencoder. The latent space can be of any size less than or equal to 64. Too large may result in a poor visualization, and too small may result in high loss. 32 should be a good starting point. Let us start with more than 1 hidden layer, and a sparcity constraint (l1 regularization). Finally, let us build the autoencoder which is a encoder model of only the layers for the encoding.
encoding_dim = 32
input_data = keras.Input(shape=(765,))
latent_1 = layers.Dense(encoding_dim * 2, activation='relu')(input_data)
latent_2 = layers.Dense(encoding_dim, activation='sigmoid',
activity_regularizer=keras.regularizers.l1(10e-5))(latent_1)
encoded = layers.Dense(encoding_dim, activation='relu',
activity_regularizer=keras.regularizers.l1(10e-5))(latent_2)
decoded = layers.Dense(765, activation='relu')(encoded)
# compile encoder and autoencoder
encoder = keras.Model(input_data, encoded)
autoencoder = keras.Model(input_data, decoded)
autoencoder.summary()
Model: "model_1" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_1 (InputLayer) [(None, 765)] 0 _________________________________________________________________ dense (Dense) (None, 64) 49024 _________________________________________________________________ dense_1 (Dense) (None, 32) 2080 _________________________________________________________________ dense_2 (Dense) (None, 32) 1056 _________________________________________________________________ dense_3 (Dense) (None, 765) 25245 ================================================================= Total params: 77,405 Trainable params: 77,405 Non-trainable params: 0 _________________________________________________________________
Once we the model is ready. Let’s train our autoencoding model using MSE loss. Finally, identify the parameters which don’t overfit, and use the same model architecture and train on all of the data together. With a latent space size of 32, we aim for 0.9 MSE loss on the test set, 0.95 with regularization.
autoencoder.compile(optimizer='adam', loss='mse')
autoencoder.fit(counts_train, counts_train,
epochs=50,
batch_size=256,
validation_data=(counts_test, counts_test))
Epoch 1/50 3/3 [==============================] - 5s 1s/step - loss: 0.9907 - val_loss: 1.0679 Epoch 2/50 3/3 [==============================] - 0s 117ms/step - loss: 0.9824 - val_loss: 1.0666 Epoch 3/50 3/3 [==============================] - 0s 190ms/step - loss: 0.9897 - val_loss: 1.0655 Epoch 4/50 3/3 [==============================] - 0s 129ms/step - loss: 0.9897 - val_loss: 1.0643 Epoch 5/50 3/3 [==============================] - 0s 178ms/step - loss: 0.9899 - val_loss: 1.0630 Epoch 6/50 3/3 [==============================] - 0s 162ms/step - loss: 0.9817 - val_loss: 1.0614 Epoch 7/50 3/3 [==============================] - 0s 83ms/step - loss: 0.9878 - val_loss: 1.0596 Epoch 8/50 3/3 [==============================] - 0s 115ms/step - loss: 0.9730 - val_loss: 1.0575 Epoch 9/50 3/3 [==============================] - 0s 147ms/step - loss: 0.9685 - val_loss: 1.0552 Epoch 10/50 3/3 [==============================] - 0s 155ms/step - loss: 0.9774 - val_loss: 1.0528 Epoch 11/50 3/3 [==============================] - 0s 126ms/step - loss: 0.9667 - val_loss: 1.0504 Epoch 12/50 3/3 [==============================] - 0s 120ms/step - loss: 0.9659 - val_loss: 1.0479 Epoch 13/50 3/3 [==============================] - 0s 84ms/step - loss: 0.9591 - val_loss: 1.0453 Epoch 14/50 3/3 [==============================] - 0s 69ms/step - loss: 0.9575 - val_loss: 1.0428 Epoch 15/50 3/3 [==============================] - 0s 74ms/step - loss: 0.9521 - val_loss: 1.0404 Epoch 16/50 3/3 [==============================] - 0s 74ms/step - loss: 0.9611 - val_loss: 1.0379 Epoch 17/50 3/3 [==============================] - 0s 126ms/step - loss: 0.9536 - val_loss: 1.0356 Epoch 18/50 3/3 [==============================] - 0s 218ms/step - loss: 0.9492 - val_loss: 1.0332 Epoch 19/50 3/3 [==============================] - 1s 235ms/step - loss: 0.9464 - val_loss: 1.0311 Epoch 20/50 3/3 [==============================] - 0s 95ms/step - loss: 0.9345 - val_loss: 1.0291 Epoch 21/50 3/3 [==============================] - 0s 124ms/step - loss: 0.9467 - val_loss: 1.0273 Epoch 22/50 3/3 [==============================] - 0s 127ms/step - loss: 0.9435 - val_loss: 1.0255 Epoch 23/50 3/3 [==============================] - 0s 142ms/step - loss: 0.9402 - val_loss: 1.0238 Epoch 24/50 3/3 [==============================] - 0s 143ms/step - loss: 0.9350 - val_loss: 1.0222 Epoch 25/50 3/3 [==============================] - 0s 149ms/step - loss: 0.9378 - val_loss: 1.0207 Epoch 26/50 3/3 [==============================] - 0s 119ms/step - loss: 0.9355 - val_loss: 1.0192 Epoch 27/50 3/3 [==============================] - 0s 199ms/step - loss: 0.9252 - val_loss: 1.0178 Epoch 28/50 3/3 [==============================] - 1s 242ms/step - loss: 0.9274 - val_loss: 1.0165 Epoch 29/50 3/3 [==============================] - 0s 137ms/step - loss: 0.9266 - val_loss: 1.0153 Epoch 30/50 3/3 [==============================] - 0s 83ms/step - loss: 0.9194 - val_loss: 1.0142 Epoch 31/50 3/3 [==============================] - 0s 127ms/step - loss: 0.9284 - val_loss: 1.0130 Epoch 32/50 3/3 [==============================] - 0s 147ms/step - loss: 0.9189 - val_loss: 1.0119 Epoch 33/50 3/3 [==============================] - 0s 89ms/step - loss: 0.9203 - val_loss: 1.0108 Epoch 34/50 3/3 [==============================] - 0s 137ms/step - loss: 0.9216 - val_loss: 1.0096 Epoch 35/50 3/3 [==============================] - 0s 138ms/step - loss: 0.9217 - val_loss: 1.0085 Epoch 36/50 3/3 [==============================] - 0s 151ms/step - loss: 0.9150 - val_loss: 1.0074 Epoch 37/50 3/3 [==============================] - 0s 140ms/step - loss: 0.9193 - val_loss: 1.0062 Epoch 38/50 3/3 [==============================] - 0s 190ms/step - loss: 0.9148 - val_loss: 1.0051 Epoch 39/50 3/3 [==============================] - 1s 242ms/step - loss: 0.9097 - val_loss: 1.0039 Epoch 40/50 3/3 [==============================] - 0s 166ms/step - loss: 0.9146 - val_loss: 1.0028 Epoch 41/50 3/3 [==============================] - 0s 160ms/step - loss: 0.9073 - val_loss: 1.0017 Epoch 42/50 3/3 [==============================] - 0s 175ms/step - loss: 0.9165 - val_loss: 1.0005 Epoch 43/50 3/3 [==============================] - 0s 172ms/step - loss: 0.9085 - val_loss: 0.9996 Epoch 44/50 3/3 [==============================] - 0s 158ms/step - loss: 0.9046 - val_loss: 0.9986 Epoch 45/50 3/3 [==============================] - 0s 147ms/step - loss: 0.9063 - val_loss: 0.9976 Epoch 46/50 3/3 [==============================] - 0s 137ms/step - loss: 0.9069 - val_loss: 0.9967 Epoch 47/50 3/3 [==============================] - 0s 170ms/step - loss: 0.9076 - val_loss: 0.9957 Epoch 48/50 3/3 [==============================] - 0s 177ms/step - loss: 0.9050 - val_loss: 0.9949 Epoch 49/50 3/3 [==============================] - 0s 179ms/step - loss: 0.9085 - val_loss: 0.9941 Epoch 50/50 3/3 [==============================] - 0s 186ms/step - loss: 0.9029 - val_loss: 0.9932
<tensorflow.python.keras.callbacks.History at 0x7fcf882aa730>
Now we use PCA and t-SNE on the dataset.
Then use PCA on the latent space representation of the dataset.
Plot all of these.
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
# Carry out PCA on counts_pd
pca = PCA(n_components=2)
X_pca = pca.fit_transform(counts_pd)
# Visualize X_pca
plt.figure(figsize=(10,10))
sns.scatterplot(
x=X_pca[:,0], y=X_pca[:,1],
hue=labels_pd['bulk_labels'],
palette=sns.color_palette("hls", 10),
legend="full",
alpha=0.75)
plt.title('PCA on Original Dataset')
plt.show()
# Carry out t-SNE on counts_pd
tsne = TSNE(n_components=2)
X_tsne = tsne.fit_transform(counts_pd)
# Visualize X_tsne
plt.figure(figsize=(10,10))
sns.scatterplot(
x=X_tsne[:,0], y=X_tsne[:,1],
hue=labels_pd['bulk_labels'],
palette=sns.color_palette("hls", 10),
legend="full",
alpha=0.75
)
plt.title('tSNE on Original Dataset')
plt.show()
# Carry out PCA on latent space
counts_encoded = encoder.predict(counts_pd)
pca = PCA(n_components=2)
encoded_pca = pca.fit_transform(counts_encoded)
# Visualize X_pca
plt.figure(figsize=(10,10))
sns.scatterplot(
x=encoded_pca[:,0], y=encoded_pca[:,1],
hue=labels_pd['bulk_labels'],
palette=sns.color_palette("hls", 10),
legend="full",
alpha=0.75)
plt.title('PCA on Latent Space Representation')
plt.show()
Comparison of Dimensionality Reduction¶
We can see that all three plots closely resemble each other, and we can identify all the major clustering patterns in all of them (e.g., yellow-red overlapping cluster, green-cyan-purple overlapping cluster). However, there are still some differences in each of the plot. First, by doing PCA on the original dataset, we were able to see about four overlapping but well-separated clusters. Next, by performing tSNE on the original dataset, we were able to see better separation of certain labels (e.g., magenta points are better separated from other points as compared to the first plot). The major drawback is that tSNE is by far the most time-consuming among these three methods. Lastly, by doing PCA on the latent space representation created by our encoder, we can still observe the same clustering patterns, while also have better separation for certain labels like the magenta points again. Therefore, I think using PCA with autoencoder was the best performing method because it is a nice compromise between PCA and tSNE on original datasets. It is fast and requires less space due to compression by encoder, while also being able to capture some details that were absent from PCA directly applied to the original dataset.
Part II. Classification
labeled in the original data. We will use a few different ideas for classifier methods to test on what accuracy we can achieve with each of them. Specifically, here we will use SVC with AdaBoostClassifier and compare it to Random Forest.
Classification ¶
import pandas as pd
import numpy as np
import keras
from keras import layers
Reminder:This data associates a cell barcode, such as “AAAGCCTGGCTAAC-1”, to a certain cell type label, such as “CD14+ Monocyte”. For each cell barcode, there are also log RNA seq counts of 765 different genes, such as HES4.
label.csv
stores the association between a cell barcode and a cell type label.
processed_counts.csv
stores the normalized log read counts for each cell, where each row represents a single cell, and each column represents a gene.
labels_pd = pd.read_csv("labels.csv")
counts_pd = pd.read_csv("processed_counts.csv")
labels_pd.index = labels_pd['index']
labels_pd.drop("index", axis=1, inplace=True)
counts_pd.index = counts_pd['Unnamed: 0']
counts_pd.drop("Unnamed: 0", axis=1, inplace=True)
df = counts_pd.merge(labels_pd, left_index=True, right_index=True).dropna()
df
HES4 | TNFRSF4 | SSU72 | PARK7 | RBP7 | … | ITGB2 | S100B | PRMT2 | MT-ND3 | bulk_labels | |
---|---|---|---|---|---|---|---|---|---|---|---|
AAAGCCTGGCTAAC-1 | -0.326 | -0.191 | -0.728 | -0.301 | 3.386 | … | 0.519 | -0.21 | -0.636 | 4.011 | CD14+ Monocyte |
AAATTCGATGCACA-1 | 1.171 | -0.191 | 0.795 | -1.200 | -0.174 | … | 1.172 | -0.21 | 2.630 | -0.490 | Dendritic |
AACACGTGGTCTTT-1 | -0.326 | -0.191 | 0.483 | -1.200 | -0.174 | … | 0.722 | -0.21 | 0.663 | -0.490 | CD56+ NK |
AAGTGCACGTGCTA-1 | -0.326 | -0.191 | 1.134 | -0.157 | -0.174 | … | 0.766 | -0.21 | -0.636 | -0.490 | CD4+/CD25 T Reg |
ACACGAACGGAGTG-1 | -0.326 | -0.191 | -0.728 | -0.607 | -0.174 | … | -0.007 | -0.21 | -0.636 | -0.490 | Dendritic |
… | … | … | … | … | … | … | … | … | … | … | … |
TGGCACCTCCAACA-8 | -0.326 | -0.191 | 0.372 | -0.584 | -0.174 | … | 0.561 | -0.21 | 0.543 | 2.593 | Dendritic |
TGTGAGTGCTTTAC-8 | 3.166 | -0.191 | -0.728 | -1.200 | -0.174 | … | -0.171 | -0.21 | 1.268 | -0.490 | Dendritic |
TGTTACTGGCGATT-8 | -0.326 | -0.191 | -0.728 | -1.200 | -0.174 | … | 0.152 | -0.21 | -0.636 | 1.226 | CD4+/CD25 T Reg |
TTCAGTACCGGGAA-8 | -0.326 | -0.191 | -0.728 | -0.386 | -0.174 | … | -0.326 | -0.21 | -0.636 | -0.490 | CD19+ B |
TTGAGGTGGAGAGC-8 | -0.326 | -0.191 | 0.148 | 0.762 | -0.174 | … | 0.239 | -0.21 | -0.636 | -0.490 | Dendritic |
700 rows × 766 columns
One-hot encode the cell-type.
Shuffle our data and make sure the labels and the counts are shuffled together.
Split into train and test sets (80:20 split)
categories = df['bulk_labels'].unique()
print(categories)
#one-hot encoding
y = np.zeros((len(df), len(categories)))
for i in range(len(df)):
cell_type = df.iloc[i]['bulk_labels']
pos = np.where(categories == cell_type)[0]
y[i, pos] = 1
#remove label when processing input data
X = df.drop('bulk_labels', axis=1).values
#shufle and 80:20 split
np.random.seed(100)
permutation = np.random.permutation(len(X))
X, y = X[permutation], y[permutation]
X_train, y_train = X[:int(len(X)*0.8)], y[:int(len(y)*0.8)]
X_test, y_test = X[int(len(X)*0.8):], y[int(len(y)*0.8):]
['CD14+ Monocyte' 'Dendritic' 'CD56+ NK' 'CD4+/CD25 T Reg' 'CD19+ B' 'CD8+ Cytotoxic T' 'CD4+/CD45RO+ Memory' 'CD8+/CD45RA+ Naive Cytotoxic' 'CD4+/CD45RA+/CD25- Naive T' 'CD34+']
#Visualize the One-hot encoded Prediction Labels
import matplotlib.pyplot as plt
plt.figure(figsize=(9,3), dpi=300)
plt.imshow(y_train[:50])
<matplotlib.image.AxesImage at 0x7f79acfa4f50>
Now we can apply classification algorithms to the training data, tune on validation data (if present), and evaluate on test data.
We can also apply classification downstream of autoencoder latent space representation.
import sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import LinearSVC
# Build and fit RandomForestClassifier
rf_model = RandomForestClassifier(n_estimators=200,
max_depth=20,
min_samples_leaf=2,
max_features=0.25,
random_state=0).fit(X_train, y_train)
# Evaluate on test data
print('Training accuracy: %.3f' % rf_model.score(X_train, y_train))
print('Testing accuracy: %.3f' % rf_model.score(X_test, y_test))
Training accuracy: 0.979 Testing accuracy: 0.736
# undo one hot encoding
def get_class_num(row):
count = 0
for i in range(len(row)):
if row[i] == 0:
count += 1
elif row[i] == 1:
return count
y_train_1d = pd.DataFrame(y_train).apply(get_class_num, axis=1)
y_test_1d = pd.DataFrame(y_test).apply(get_class_num, axis=1)
# build and fit AdaBoostClassifier
acc_train, acc_test = [], []
n_est = list(range(50, 450, 50))
for n in n_est:
svc = LinearSVC()
abc_model_1 = AdaBoostClassifier(base_estimator=svc,
n_estimators=n,
algorithm="SAMME",
random_state=0).fit(X_train, y_train_1d)
abc_model_2 = AdaBoostClassifier(base_estimator=rf_model,
n_estimators=n,
algorithm="SAMME",
random_state=0).fit(X_train, y_train_1d)
abc_model_3 = AdaBoostClassifier(n_estimators=n,
algorithm="SAMME",
random_state=0).fit(X_train, y_train_1d)
models = [abc_model_1, abc_model_2, abc_model_3]
acc_train.append([m.score(X_train, y_train_1d) for m in models])
acc_test.append([m.score(X_test, y_test_1d) for m in models])
# Accuracy Comparison
acc_train = np.array(acc_train)
acc_test = np.array(acc_test)
plt.figure(figsize=(12,4))
plt.subplot(121)
plt.plot(n_est, acc_train[:, 0], label='svc')
plt.plot(n_est, acc_train[:, 1], label='RandomForest')
plt.plot(n_est, acc_train[:, 2], label='DecisionTree')
plt.title('Training Accuracy')
plt.legend()
plt.xlabel('n_estimators')
plt.subplot(122)
plt.plot(n_est, acc_test[:, 0], label='svc')
plt.plot(n_est, acc_test[:, 1], label='RandomForest')
plt.plot(n_est, acc_test[:, 2], label='DecisionTree')
plt.title('Testing Accuracy')
plt.xlabel('n_estimators')
plt.legend()
plt.show()
Classification¶
I first used random forests as my classifier, but I was not able to get higher testing accuracy by adjusting its hyperparameters. Then I wanted to use boosting in addition to random forests, so I started exploring AdaBoostClassifier, which by default uses decision trees with max_depth=1
as its base_estimator
. However, as it is shown above, decision tree is not a good choice with any number of estimators. So, I used my previous random forest model as the base_estimator
and it turned out to work quite well, where its training accuracy is 1 and testing accuracy is consistenly above 0.8. Lastly, to further improve my classifier, I explored the support vector machines (SVM), which is another popular classification algorithm, and I decided to use SVM with linear kernel as my base_estimator
. It performed even better than my random forest model, consistenly hitting testing accuracy above 0.85. Therefore, my best model is AdaBoostClassifier with linear SVM as its base_estimator
, with testing accuracy above 0.85, and training accuracy of 1, while also being the most efficient.