There are many applications of single cell RNA-seq and the computational challenges that arise from this complex data. In this project, we will work directly with scRNA-seq data in a real-world application.

Background on the Dataset

Peripheral blood mononuclear cells (PBMCs) are a group of different cell types in our blood, a subset of what we call white blood cells. They include many important parts of our immune system: T cells, B cells, natural killer cells, and so on. These cells are responsible for our innate and adaptive immune responses. When we get infected or vaccinated, they kick into high gear. In different situations, or with different diseases, the mix of different cell types shifts and so does the gene expression within one type of cell. PBMCs can be isolated very easily from patient blood samples, just by spinning the blood in a centrifuge that separates different cell types by weight. So, looking at these cells can be very informative.

Part I. Autoencoder

Starting from scRNA-seq data from PBMCs, we will look at the cells in a lower-dimension space. We will first implement an autoencoder to find a latent space representation of our data. Then, we will compare two-dimensional representations of our data using t-SNE, PCA, and the latent space defined by our autoencoder.

Final_Project_Part_1_Autoencoder

Autoencoder ¶

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import keras
from keras import layers

This data associates a cell barcode, such as “AAAGCCTGGCTAAC-1”, to a certain cell type label, such as “CD14+ Monocyte”. For each cell barcode, there are also log RNA seq counts of 765 different genes, such as HES4.

label.csv stores the association between a cell barcode and a cell type label.

processed_counts.csv stores the normalized log read counts for each cell, where each row represents a single cell, and each column represents a gene.

In [2]:

labels_pd = pd.read_csv("labels.csv", index_col=0)
counts_pd = pd.read_csv("processed_counts.csv", index_col=0)

In [3]:

labels_pd.head()

Out[3]:

	bulk_labels
index
AAAGCCTGGCTAAC-1	CD14+ Monocyte
AAATTCGATGCACA-1	Dendritic
AACACGTGGTCTTT-1	CD56+ NK
AAGTGCACGTGCTA-1	CD4+/CD25 T Reg
ACACGAACGGAGTG-1	Dendritic

In [4]:

counts_pd.head()

Out[4]:

	HES4	TNFRSF4	SSU72	PARK7	RBP7	…	SUMO3	ITGB2	S100B	PRMT2	MT-ND3
AAAGCCTGGCTAAC-1	-0.326	-0.191	-0.728	-0.301	3.386	…	4.294	0.519	-0.21	-0.636	4.011
AAATTCGATGCACA-1	1.171	-0.191	0.795	-1.200	-0.174	…	-0.585	1.172	-0.21	2.630	-0.490
AACACGTGGTCTTT-1	-0.326	-0.191	0.483	-1.200	-0.174	…	-0.585	0.722	-0.21	0.663	-0.490
AAGTGCACGTGCTA-1	-0.326	-0.191	1.134	-0.157	-0.174	…	-0.585	0.766	-0.21	-0.636	-0.490
ACACGAACGGAGTG-1	-0.326	-0.191	-0.728	-0.607	-0.174	…	-0.585	-0.007	-0.21	-0.636	-0.490

5 rows × 765 columns

Split into train and test sets (80:20 split)

In [5]:

counts_shuffled = counts_pd.sample(frac=1)
split = np.random.rand(len(counts_shuffled)) < 0.8
counts_train = counts_shuffled[split]
counts_test = counts_shuffled[~split]

Create a fully connected neural network for our autoencoder. The latent space can be of any size less than or equal to 64. Too large may result in a poor visualization, and too small may result in high loss. 32 should be a good starting point. Let us start with more than 1 hidden layer, and a sparcity constraint (l1 regularization). Finally, let us build the autoencoder which is a encoder model of only the layers for the encoding.

In [6]:

encoding_dim = 32

input_data = keras.Input(shape=(765,))
latent_1 = layers.Dense(encoding_dim * 2, activation='relu')(input_data)
latent_2 = layers.Dense(encoding_dim, activation='sigmoid',
                       activity_regularizer=keras.regularizers.l1(10e-5))(latent_1)
encoded = layers.Dense(encoding_dim, activation='relu',
                       activity_regularizer=keras.regularizers.l1(10e-5))(latent_2)
decoded = layers.Dense(765, activation='relu')(encoded)

# compile encoder and autoencoder
encoder = keras.Model(input_data, encoded)
autoencoder = keras.Model(input_data, decoded)
autoencoder.summary()

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         [(None, 765)]             0         
_________________________________________________________________
dense (Dense)                (None, 64)                49024     
_________________________________________________________________
dense_1 (Dense)              (None, 32)                2080      
_________________________________________________________________
dense_2 (Dense)              (None, 32)                1056      
_________________________________________________________________
dense_3 (Dense)              (None, 765)               25245     
=================================================================
Total params: 77,405
Trainable params: 77,405
Non-trainable params: 0
_________________________________________________________________

Once we the model is ready. Let’s train our autoencoding model using MSE loss. Finally, identify the parameters which don’t overfit, and use the same model architecture and train on all of the data together. With a latent space size of 32, we aim for 0.9 MSE loss on the test set, 0.95 with regularization.

In [7]:

autoencoder.compile(optimizer='adam', loss='mse')
autoencoder.fit(counts_train, counts_train,
                epochs=50,
                batch_size=256,
                validation_data=(counts_test, counts_test))

Epoch 1/50
3/3 [==============================] - 5s 1s/step - loss: 0.9907 - val_loss: 1.0679
Epoch 2/50
3/3 [==============================] - 0s 117ms/step - loss: 0.9824 - val_loss: 1.0666
Epoch 3/50
3/3 [==============================] - 0s 190ms/step - loss: 0.9897 - val_loss: 1.0655
Epoch 4/50
3/3 [==============================] - 0s 129ms/step - loss: 0.9897 - val_loss: 1.0643
Epoch 5/50
3/3 [==============================] - 0s 178ms/step - loss: 0.9899 - val_loss: 1.0630
Epoch 6/50
3/3 [==============================] - 0s 162ms/step - loss: 0.9817 - val_loss: 1.0614
Epoch 7/50
3/3 [==============================] - 0s 83ms/step - loss: 0.9878 - val_loss: 1.0596
Epoch 8/50
3/3 [==============================] - 0s 115ms/step - loss: 0.9730 - val_loss: 1.0575
Epoch 9/50
3/3 [==============================] - 0s 147ms/step - loss: 0.9685 - val_loss: 1.0552
Epoch 10/50
3/3 [==============================] - 0s 155ms/step - loss: 0.9774 - val_loss: 1.0528
Epoch 11/50
3/3 [==============================] - 0s 126ms/step - loss: 0.9667 - val_loss: 1.0504
Epoch 12/50
3/3 [==============================] - 0s 120ms/step - loss: 0.9659 - val_loss: 1.0479
Epoch 13/50
3/3 [==============================] - 0s 84ms/step - loss: 0.9591 - val_loss: 1.0453
Epoch 14/50
3/3 [==============================] - 0s 69ms/step - loss: 0.9575 - val_loss: 1.0428
Epoch 15/50
3/3 [==============================] - 0s 74ms/step - loss: 0.9521 - val_loss: 1.0404
Epoch 16/50
3/3 [==============================] - 0s 74ms/step - loss: 0.9611 - val_loss: 1.0379
Epoch 17/50
3/3 [==============================] - 0s 126ms/step - loss: 0.9536 - val_loss: 1.0356
Epoch 18/50
3/3 [==============================] - 0s 218ms/step - loss: 0.9492 - val_loss: 1.0332
Epoch 19/50
3/3 [==============================] - 1s 235ms/step - loss: 0.9464 - val_loss: 1.0311
Epoch 20/50
3/3 [==============================] - 0s 95ms/step - loss: 0.9345 - val_loss: 1.0291
Epoch 21/50
3/3 [==============================] - 0s 124ms/step - loss: 0.9467 - val_loss: 1.0273
Epoch 22/50
3/3 [==============================] - 0s 127ms/step - loss: 0.9435 - val_loss: 1.0255
Epoch 23/50
3/3 [==============================] - 0s 142ms/step - loss: 0.9402 - val_loss: 1.0238
Epoch 24/50
3/3 [==============================] - 0s 143ms/step - loss: 0.9350 - val_loss: 1.0222
Epoch 25/50
3/3 [==============================] - 0s 149ms/step - loss: 0.9378 - val_loss: 1.0207
Epoch 26/50
3/3 [==============================] - 0s 119ms/step - loss: 0.9355 - val_loss: 1.0192
Epoch 27/50
3/3 [==============================] - 0s 199ms/step - loss: 0.9252 - val_loss: 1.0178
Epoch 28/50
3/3 [==============================] - 1s 242ms/step - loss: 0.9274 - val_loss: 1.0165
Epoch 29/50
3/3 [==============================] - 0s 137ms/step - loss: 0.9266 - val_loss: 1.0153
Epoch 30/50
3/3 [==============================] - 0s 83ms/step - loss: 0.9194 - val_loss: 1.0142
Epoch 31/50
3/3 [==============================] - 0s 127ms/step - loss: 0.9284 - val_loss: 1.0130
Epoch 32/50
3/3 [==============================] - 0s 147ms/step - loss: 0.9189 - val_loss: 1.0119
Epoch 33/50
3/3 [==============================] - 0s 89ms/step - loss: 0.9203 - val_loss: 1.0108
Epoch 34/50
3/3 [==============================] - 0s 137ms/step - loss: 0.9216 - val_loss: 1.0096
Epoch 35/50
3/3 [==============================] - 0s 138ms/step - loss: 0.9217 - val_loss: 1.0085
Epoch 36/50
3/3 [==============================] - 0s 151ms/step - loss: 0.9150 - val_loss: 1.0074
Epoch 37/50
3/3 [==============================] - 0s 140ms/step - loss: 0.9193 - val_loss: 1.0062
Epoch 38/50
3/3 [==============================] - 0s 190ms/step - loss: 0.9148 - val_loss: 1.0051
Epoch 39/50
3/3 [==============================] - 1s 242ms/step - loss: 0.9097 - val_loss: 1.0039
Epoch 40/50
3/3 [==============================] - 0s 166ms/step - loss: 0.9146 - val_loss: 1.0028
Epoch 41/50
3/3 [==============================] - 0s 160ms/step - loss: 0.9073 - val_loss: 1.0017
Epoch 42/50
3/3 [==============================] - 0s 175ms/step - loss: 0.9165 - val_loss: 1.0005
Epoch 43/50
3/3 [==============================] - 0s 172ms/step - loss: 0.9085 - val_loss: 0.9996
Epoch 44/50
3/3 [==============================] - 0s 158ms/step - loss: 0.9046 - val_loss: 0.9986
Epoch 45/50
3/3 [==============================] - 0s 147ms/step - loss: 0.9063 - val_loss: 0.9976
Epoch 46/50
3/3 [==============================] - 0s 137ms/step - loss: 0.9069 - val_loss: 0.9967
Epoch 47/50
3/3 [==============================] - 0s 170ms/step - loss: 0.9076 - val_loss: 0.9957
Epoch 48/50
3/3 [==============================] - 0s 177ms/step - loss: 0.9050 - val_loss: 0.9949
Epoch 49/50
3/3 [==============================] - 0s 179ms/step - loss: 0.9085 - val_loss: 0.9941
Epoch 50/50
3/3 [==============================] - 0s 186ms/step - loss: 0.9029 - val_loss: 0.9932

Out[7]:

<tensorflow.python.keras.callbacks.History at 0x7fcf882aa730>

Now we use PCA and t-SNE on the dataset.

Then use PCA on the latent space representation of the dataset.

Plot all of these.

In [8]:

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

In [9]:

# Carry out PCA on counts_pd
pca = PCA(n_components=2)
X_pca = pca.fit_transform(counts_pd)

# Visualize X_pca
plt.figure(figsize=(10,10))
sns.scatterplot(
    x=X_pca[:,0], y=X_pca[:,1],
    hue=labels_pd['bulk_labels'],
    palette=sns.color_palette("hls", 10),
    legend="full",
    alpha=0.75)

plt.title('PCA on Original Dataset')
plt.show()

In [10]:

# Carry out t-SNE on counts_pd
tsne = TSNE(n_components=2)
X_tsne = tsne.fit_transform(counts_pd)

# Visualize X_tsne
plt.figure(figsize=(10,10))
sns.scatterplot(
    x=X_tsne[:,0], y=X_tsne[:,1],
    hue=labels_pd['bulk_labels'],
    palette=sns.color_palette("hls", 10),
    legend="full",
    alpha=0.75
)

plt.title('tSNE on Original Dataset')
plt.show()

In [11]:

# Carry out PCA on latent space
counts_encoded = encoder.predict(counts_pd)

pca = PCA(n_components=2)
encoded_pca = pca.fit_transform(counts_encoded)

# Visualize X_pca
plt.figure(figsize=(10,10))
sns.scatterplot(
    x=encoded_pca[:,0], y=encoded_pca[:,1],
    hue=labels_pd['bulk_labels'],
    palette=sns.color_palette("hls", 10),
    legend="full",
    alpha=0.75)

plt.title('PCA on Latent Space Representation')
plt.show()

Comparison of Dimensionality Reduction¶

We can see that all three plots closely resemble each other, and we can identify all the major clustering patterns in all of them (e.g., yellow-red overlapping cluster, green-cyan-purple overlapping cluster). However, there are still some differences in each of the plot. First, by doing PCA on the original dataset, we were able to see about four overlapping but well-separated clusters. Next, by performing tSNE on the original dataset, we were able to see better separation of certain labels (e.g., magenta points are better separated from other points as compared to the first plot). The major drawback is that tSNE is by far the most time-consuming among these three methods. Lastly, by doing PCA on the latent space representation created by our encoder, we can still observe the same clustering patterns, while also have better separation for certain labels like the magenta points again. Therefore, I think using PCA with autoencoder was the best performing method because it is a nice compromise between PCA and tSNE on original datasets. It is fast and requires less space due to compression by encoder, while also being able to capture some details that were absent from PCA directly applied to the original dataset.

Part II. Classification

labeled in the original data. We will use a few different ideas for classifier methods to test on what accuracy we can achieve with each of them. Specifically, here we will use SVC with AdaBoostClassifier and compare it to Random Forest.

Final_Project_Part_2_Classification

Classification ¶

In [1]:

import pandas as pd
import numpy as np
import keras
from keras import layers

Reminder:This data associates a cell barcode, such as “AAAGCCTGGCTAAC-1”, to a certain cell type label, such as “CD14+ Monocyte”. For each cell barcode, there are also log RNA seq counts of 765 different genes, such as HES4.

label.csv stores the association between a cell barcode and a cell type label.

processed_counts.csv stores the normalized log read counts for each cell, where each row represents a single cell, and each column represents a gene.

In [2]:

labels_pd = pd.read_csv("labels.csv")
counts_pd = pd.read_csv("processed_counts.csv")

In [3]:

labels_pd.index = labels_pd['index']
labels_pd.drop("index", axis=1, inplace=True)
counts_pd.index = counts_pd['Unnamed: 0']
counts_pd.drop("Unnamed: 0", axis=1, inplace=True)

df = counts_pd.merge(labels_pd, left_index=True, right_index=True).dropna()
df

Out[3]:

	HES4	TNFRSF4	SSU72	PARK7	RBP7	…	ITGB2	S100B	PRMT2	MT-ND3	bulk_labels
AAAGCCTGGCTAAC-1	-0.326	-0.191	-0.728	-0.301	3.386	…	0.519	-0.21	-0.636	4.011	CD14+ Monocyte
AAATTCGATGCACA-1	1.171	-0.191	0.795	-1.200	-0.174	…	1.172	-0.21	2.630	-0.490	Dendritic
AACACGTGGTCTTT-1	-0.326	-0.191	0.483	-1.200	-0.174	…	0.722	-0.21	0.663	-0.490	CD56+ NK
AAGTGCACGTGCTA-1	-0.326	-0.191	1.134	-0.157	-0.174	…	0.766	-0.21	-0.636	-0.490	CD4+/CD25 T Reg
ACACGAACGGAGTG-1	-0.326	-0.191	-0.728	-0.607	-0.174	…	-0.007	-0.21	-0.636	-0.490	Dendritic
…	…	…	…	…	…	…	…	…	…	…	…
TGGCACCTCCAACA-8	-0.326	-0.191	0.372	-0.584	-0.174	…	0.561	-0.21	0.543	2.593	Dendritic
TGTGAGTGCTTTAC-8	3.166	-0.191	-0.728	-1.200	-0.174	…	-0.171	-0.21	1.268	-0.490	Dendritic
TGTTACTGGCGATT-8	-0.326	-0.191	-0.728	-1.200	-0.174	…	0.152	-0.21	-0.636	1.226	CD4+/CD25 T Reg
TTCAGTACCGGGAA-8	-0.326	-0.191	-0.728	-0.386	-0.174	…	-0.326	-0.21	-0.636	-0.490	CD19+ B
TTGAGGTGGAGAGC-8	-0.326	-0.191	0.148	0.762	-0.174	…	0.239	-0.21	-0.636	-0.490	Dendritic

700 rows × 766 columns

One-hot encode the cell-type.

Shuffle our data and make sure the labels and the counts are shuffled together.

Split into train and test sets (80:20 split)

In [4]:

categories = df['bulk_labels'].unique()
print(categories)

#one-hot encoding
y = np.zeros((len(df), len(categories)))
for i in range(len(df)):
    cell_type = df.iloc[i]['bulk_labels']
    pos = np.where(categories == cell_type)[0]
    y[i, pos] = 1

#remove label when processing input data
X = df.drop('bulk_labels', axis=1).values

#shufle and 80:20 split
np.random.seed(100)
permutation = np.random.permutation(len(X))
X, y = X[permutation], y[permutation]

X_train, y_train = X[:int(len(X)*0.8)], y[:int(len(y)*0.8)]
X_test, y_test = X[int(len(X)*0.8):], y[int(len(y)*0.8):]

['CD14+ Monocyte' 'Dendritic' 'CD56+ NK' 'CD4+/CD25 T Reg' 'CD19+ B'
 'CD8+ Cytotoxic T' 'CD4+/CD45RO+ Memory' 'CD8+/CD45RA+ Naive Cytotoxic'
 'CD4+/CD45RA+/CD25- Naive T' 'CD34+']

In [5]:

#Visualize the One-hot encoded Prediction Labels
import matplotlib.pyplot as plt
plt.figure(figsize=(9,3), dpi=300)
plt.imshow(y_train[:50])

Out[5]:

<matplotlib.image.AxesImage at 0x7f79acfa4f50>

Now we can apply classification algorithms to the training data, tune on validation data (if present), and evaluate on test data.

We can also apply classification downstream of autoencoder latent space representation.

In [6]:

import sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import LinearSVC

In [7]:

# Build and fit RandomForestClassifier
rf_model = RandomForestClassifier(n_estimators=200,
                                  max_depth=20,
                                  min_samples_leaf=2,
                                  max_features=0.25,
                                  random_state=0).fit(X_train, y_train)

# Evaluate on test data
print('Training accuracy: %.3f' % rf_model.score(X_train, y_train))
print('Testing accuracy: %.3f' % rf_model.score(X_test, y_test))

Training accuracy: 0.979
Testing accuracy: 0.736

In [8]:

# undo one hot encoding
def get_class_num(row):
    count = 0
    for i in range(len(row)):
        if row[i] == 0:
            count += 1
        elif row[i] == 1:
            return count

y_train_1d = pd.DataFrame(y_train).apply(get_class_num, axis=1)
y_test_1d = pd.DataFrame(y_test).apply(get_class_num, axis=1)

In [9]:

# build and fit AdaBoostClassifier
acc_train, acc_test = [], []
n_est = list(range(50, 450, 50))
for n in n_est:
    svc = LinearSVC()
    abc_model_1 = AdaBoostClassifier(base_estimator=svc,
                                   n_estimators=n,
                                   algorithm="SAMME",
                                   random_state=0).fit(X_train, y_train_1d)
    abc_model_2 = AdaBoostClassifier(base_estimator=rf_model,
                                   n_estimators=n,
                                   algorithm="SAMME",
                                   random_state=0).fit(X_train, y_train_1d)
    abc_model_3 = AdaBoostClassifier(n_estimators=n,
                                   algorithm="SAMME",
                                   random_state=0).fit(X_train, y_train_1d)
    models = [abc_model_1, abc_model_2, abc_model_3]
    acc_train.append([m.score(X_train, y_train_1d) for m in models])
    acc_test.append([m.score(X_test, y_test_1d) for m in models])

In [10]:

# Accuracy Comparison
acc_train = np.array(acc_train)
acc_test = np.array(acc_test)

plt.figure(figsize=(12,4))
plt.subplot(121)

plt.plot(n_est, acc_train[:, 0], label='svc')
plt.plot(n_est, acc_train[:, 1], label='RandomForest')
plt.plot(n_est, acc_train[:, 2], label='DecisionTree')
plt.title('Training Accuracy')
plt.legend()
plt.xlabel('n_estimators')

plt.subplot(122)

plt.plot(n_est, acc_test[:, 0], label='svc')
plt.plot(n_est, acc_test[:, 1], label='RandomForest')
plt.plot(n_est, acc_test[:, 2], label='DecisionTree')
plt.title('Testing Accuracy')
plt.xlabel('n_estimators')
plt.legend()
plt.show()

Classification¶

I first used random forests as my classifier, but I was not able to get higher testing accuracy by adjusting its hyperparameters. Then I wanted to use boosting in addition to random forests, so I started exploring AdaBoostClassifier, which by default uses decision trees with max_depth=1 as its base_estimator. However, as it is shown above, decision tree is not a good choice with any number of estimators. So, I used my previous random forest model as the base_estimator and it turned out to work quite well, where its training accuracy is 1 and testing accuracy is consistenly above 0.8. Lastly, to further improve my classifier, I explored the support vector machines (SVM), which is another popular classification algorithm, and I decided to use SVM with linear kernel as my base_estimator. It performed even better than my random forest model, consistenly hitting testing accuracy above 0.85. Therefore, my best model is AdaBoostClassifier with linear SVM as its base_estimator, with testing accuracy above 0.85, and training accuracy of 1, while also being the most efficient.

Zihan Zhong

Classification of Cell Type Using NN on scRNA-seq Data

Background on the Dataset

Part I. Autoencoder

Autoencoder ¶

Comparison of Dimensionality Reduction¶

Part II. Classification

Classification ¶

Classification¶