{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "toc": true
   },
   "source": [
    "<h1>Table of Contents<span class=\"tocSkip\"></span></h1>\n",
    "<div class=\"toc\" style=\"margin-top: 1em;\"><ul class=\"toc-item\"><li><span><a href=\"#Calculation-of-multiplet-frequency-from-cell-type-mixing-in-Python\" data-toc-modified-id=\"Calculation-of-multiplet-frequency-from-cell-type-mixing-in-Python-1\">Calculation of multiplet frequency from cell-type mixing in <code>Python</code></a></span><ul class=\"toc-item\"><li><span><a href=\"#Function-to-compute-multiplet-frequency\" data-toc-modified-id=\"Function-to-compute-multiplet-frequency-1.1\">Function to compute multiplet frequency</a></span></li><li><span><a href=\"#Python-function\" data-toc-modified-id=\"Python-function-1.2\">Python function</a></span></li><li><span><a href=\"#Example-when-cell-types-mixed-in-equal-proportion\" data-toc-modified-id=\"Example-when-cell-types-mixed-in-equal-proportion-1.3\">Example when cell types mixed in equal proportion</a></span></li><li><span><a href=\"#Example-when-cell-types-are-mixed-unequally\" data-toc-modified-id=\"Example-when-cell-types-are-mixed-unequally-1.4\">Example when cell types are mixed unequally</a></span></li><li><span><a href=\"#Write-results-to-LaTex-tables-for-paper\" data-toc-modified-id=\"Write-results-to-LaTex-tables-for-paper-1.5\">Write results to LaTex tables for paper</a></span></li></ul></li></ul></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Calculation of multiplet frequency from cell-type mixing in `Python`\n",
    "Here we implement the simple function to calculate the multiplet frequency from single-cell RNA sequencing experiments where we have mixed cells of two types (e.g., human and mouse), and know the number of observed droplets that contain cells of each type."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Function to compute multiplet frequency\n",
    "The multiplet frequency is computed in terms of the following three experimental observables:\n",
    "  - the number of droplets that contain at least one cell of type 1, which we denote as $N_1$\n",
    "  - the number of droplets that contain at least one of type 2, which we denote as $N_2$\n",
    "  - the number of droplets containing cells of both type 1 and type 2, which we denote as $N_{1,2}$\n",
    "\n",
    "The multiplet frequency $M$ is\n",
    "$$M = 1 - \\frac{\\left(\\mu_1 + \\mu_2\\right)e^{-\\mu_1 - \\mu_2}}{1 - e^{-\\mu_1 - \\mu_2}}$$\n",
    "where\n",
    "$$\\mu_1 = -\\ln\\left(\\frac{N - N_1}{N}\\right)$$\n",
    "and\n",
    "$$\\mu_2 = -\\ln\\left(\\frac{N - N_2}{N}\\right),$$\n",
    "and where\n",
    "$$N = \\frac{N_1 N_2}{N_{12}}.$$\n",
    "\n",
    "## Python function\n",
    "Here is the calculation implemented as Python function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy\n",
    "\n",
    "def multipletFreq(n1, n2, n12):\n",
    "    \"\"\"Estimated multiplet frequency from cell-type mixing experiment.\n",
    "\n",
    "    `n1`, `n2`, `n12` (`int` or `numpy.ndarray` of integers)\n",
    "        Number of droplets with at least one cell of type 1,\n",
    "        at least one cell of type 2, or cells of both types.\n",
    "    \"\"\"\n",
    "    n = numpy.array(n1 * n2 / n12).astype('float')\n",
    "    mu1 = -numpy.log((n - n1) / n)\n",
    "    mu2 = -numpy.log((n - n2) / n)\n",
    "    mu = mu1 + mu2\n",
    "    return 1 - mu * numpy.exp(-mu) / (1 - numpy.exp(-mu))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example when cell types mixed in equal proportion\n",
    "First we demonstrate the calculations in the case when the cell types are mixed in equal proportions.\n",
    "\n",
    "Let's create some hypothetical data.\n",
    "We'll imagine that these data come from the [10X cellranger](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger) software analysis of a multi-species experiment that mixed mouse and human cells equally.\n",
    "The current version (2.1.1) of the [10X cellranger](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger) pipeline for this type of study would return the *human Estimated Number of Cell Partitions*, the *mouse Estimated Number of Cell Partitions*, and the number of *GEMs with > 0 Cells* (this last quantity is also reported as the *Estimated Number of Cells*).\n",
    "These statistics give the numbers of non-empty GEMs with cells of each type (GEMs is the term that 10X uses to refer to their droplets).\n",
    "\n",
    "Here is a data frame of some hypothetical data from three different experiments, each with 4000 cells total but with different numbers of cross-celltype droplets:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/jbloom/Documents/software/conda/envs/BloomLab/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88\n",
      "  return f(*args, **kwds)\n",
      "/Users/jbloom/Documents/software/conda/envs/BloomLab/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88\n",
      "  return f(*args, **kwds)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>human_droplets</th>\n",
       "      <th>mouse_droplets</th>\n",
       "      <th>nonempty_droplets</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>experiment</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2005</td>\n",
       "      <td>2005</td>\n",
       "      <td>4000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2050</td>\n",
       "      <td>2050</td>\n",
       "      <td>4000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2500</td>\n",
       "      <td>2500</td>\n",
       "      <td>4000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            human_droplets  mouse_droplets  nonempty_droplets\n",
       "experiment                                                   \n",
       "1                     2005            2005               4000\n",
       "2                     2050            2050               4000\n",
       "3                     2500            2500               4000"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas\n",
    "\n",
    "df_equal = (pandas.DataFrame({\n",
    "                'human_droplets':[2005, 2050, 2500],\n",
    "                'mouse_droplets':[2005, 2050, 2500],\n",
    "                'nonempty_droplets':[4000, 4000, 4000]\n",
    "                },\n",
    "                index=numpy.arange(3) + 1)\n",
    "            .rename_axis('experiment')\n",
    "            )\n",
    "\n",
    "df_equal"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We calculate the number of droplets with cells of **both** types (human and mouse) simply as the sum of the human and mouse droplets minus the total number of non-empty droplets, since these cross-celltype droplets are double counted in the tally of human and mouse droplets.\n",
    "As is apparent after this calculation, in this hypothetical example the cross-celltype droplets represent 0.25%, 2.5%, and 25% of the total non-empty droplets in the three examples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>human_droplets</th>\n",
       "      <th>mouse_droplets</th>\n",
       "      <th>nonempty_droplets</th>\n",
       "      <th>human_and_mouse_droplets</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>experiment</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2005</td>\n",
       "      <td>2005</td>\n",
       "      <td>4000</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2050</td>\n",
       "      <td>2050</td>\n",
       "      <td>4000</td>\n",
       "      <td>100</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2500</td>\n",
       "      <td>2500</td>\n",
       "      <td>4000</td>\n",
       "      <td>1000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            human_droplets  mouse_droplets  nonempty_droplets  \\\n",
       "experiment                                                      \n",
       "1                     2005            2005               4000   \n",
       "2                     2050            2050               4000   \n",
       "3                     2500            2500               4000   \n",
       "\n",
       "            human_and_mouse_droplets  \n",
       "experiment                            \n",
       "1                                 10  \n",
       "2                                100  \n",
       "3                               1000  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_equal = df_equal.assign(human_and_mouse_droplets=lambda x:\n",
    "        x.human_droplets + x.mouse_droplets - x.nonempty_droplets)\n",
    "\n",
    "df_equal"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now calculate the multiplet frequency in two ways.\n",
    "The first way is to precisely calculate the multiplet frequency using the exact Poisson derivation as implemented in the `multipletFreq` function above.\n",
    "\n",
    "The second way is to do the simple calculation that has commonly been used in paper that do equal-proportion mixing.\n",
    "This method is to simply estimate the multiplet frequency as twice the frequency of cross-cell-type droplets among all non-empty droplets (that is, as $\\frac{2 \\times N_{1,2}}{N_1 + N_2 + N_{12}}$)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>human_droplets</th>\n",
       "      <th>mouse_droplets</th>\n",
       "      <th>nonempty_droplets</th>\n",
       "      <th>human_and_mouse_droplets</th>\n",
       "      <th>multiplet_freq</th>\n",
       "      <th>twice_cross_celltype_freq</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>experiment</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2005</td>\n",
       "      <td>2005</td>\n",
       "      <td>4000</td>\n",
       "      <td>10</td>\n",
       "      <td>0.005</td>\n",
       "      <td>0.005</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2050</td>\n",
       "      <td>2050</td>\n",
       "      <td>4000</td>\n",
       "      <td>100</td>\n",
       "      <td>0.049</td>\n",
       "      <td>0.050</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2500</td>\n",
       "      <td>2500</td>\n",
       "      <td>4000</td>\n",
       "      <td>1000</td>\n",
       "      <td>0.425</td>\n",
       "      <td>0.500</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            human_droplets  mouse_droplets  nonempty_droplets  \\\n",
       "experiment                                                      \n",
       "1                     2005            2005               4000   \n",
       "2                     2050            2050               4000   \n",
       "3                     2500            2500               4000   \n",
       "\n",
       "            human_and_mouse_droplets  multiplet_freq  \\\n",
       "experiment                                             \n",
       "1                                 10           0.005   \n",
       "2                                100           0.049   \n",
       "3                               1000           0.425   \n",
       "\n",
       "            twice_cross_celltype_freq  \n",
       "experiment                             \n",
       "1                               0.005  \n",
       "2                               0.050  \n",
       "3                               0.500  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_equal = (df_equal\n",
    "            .assign(multiplet_freq=lambda x: \n",
    "                multipletFreq(x.human_droplets,\n",
    "                              x.mouse_droplets,\n",
    "                              x.human_and_mouse_droplets))\n",
    "            .assign(twice_cross_celltype_freq=lambda x:\n",
    "                2 * x.human_and_mouse_droplets / x.nonempty_droplets)\n",
    "            )\n",
    "\n",
    "df_equal.round(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As can be seen above, the two methods give virtually identical results as long as the number of cross-celltype droplets is small relative to the total number of droplets.\n",
    "The reason that the two methods aren't identical as simply calculating the multiplet frequency as twice the cross-celltype frequency neglects to account for droplets that have more than two cells.\n",
    "However, this difference only becomes appreciable when the multiplet frequency is high.\n",
    "So for the examples above, we can see that it only really matters in the case when the true multiplet frequency is $\\approx$0.425; in that case, simply taking twice the cross-celltype frequency slightly overestimates the true multiplet frequency"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example when cell types are mixed unequally\n",
    "Now we repeat the example above, but for experiments where the cell types are mixed unequally.\n",
    "\n",
    "Below we give some example calculations for various numbers. \n",
    "An interesting (and initially non-intuitve aspect) of the results is that when the cells are mixed highly unequally and multiplets are common, the multiplet frequency is actually substantially **less** than the fraction of droplets with the rare cell type that are multiplets. \n",
    "The reason is that multiplets are more likely than singlets to have a cell of the rarer type, and become progressively more likely to have a cell of the rare type as the number of cells in the multiplet increases."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>human_droplets</th>\n",
       "      <th>mouse_droplets</th>\n",
       "      <th>nonempty_droplets</th>\n",
       "      <th>human_and_mouse_droplets</th>\n",
       "      <th>multiplet_freq</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>experiment</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2050</td>\n",
       "      <td>2050</td>\n",
       "      <td>4000</td>\n",
       "      <td>100</td>\n",
       "      <td>0.049</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3050</td>\n",
       "      <td>1050</td>\n",
       "      <td>4000</td>\n",
       "      <td>100</td>\n",
       "      <td>0.065</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3550</td>\n",
       "      <td>550</td>\n",
       "      <td>4000</td>\n",
       "      <td>100</td>\n",
       "      <td>0.110</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>3850</td>\n",
       "      <td>250</td>\n",
       "      <td>4000</td>\n",
       "      <td>100</td>\n",
       "      <td>0.245</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>3950</td>\n",
       "      <td>150</td>\n",
       "      <td>4000</td>\n",
       "      <td>100</td>\n",
       "      <td>0.459</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            human_droplets  mouse_droplets  nonempty_droplets  \\\n",
       "experiment                                                      \n",
       "1                     2050            2050               4000   \n",
       "2                     3050            1050               4000   \n",
       "3                     3550             550               4000   \n",
       "4                     3850             250               4000   \n",
       "5                     3950             150               4000   \n",
       "\n",
       "            human_and_mouse_droplets  multiplet_freq  \n",
       "experiment                                            \n",
       "1                                100           0.049  \n",
       "2                                100           0.065  \n",
       "3                                100           0.110  \n",
       "4                                100           0.245  \n",
       "5                                100           0.459  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_unequal = (pandas.DataFrame({\n",
    "                  'human_droplets':[2050, 3050, 3550, 3850, 3950],\n",
    "                  'mouse_droplets':[2050, 1050, 550, 250, 150],\n",
    "                  'nonempty_droplets':[4000, 4000, 4000, 4000, 4000]\n",
    "                  },\n",
    "                  index=numpy.arange(5) + 1)\n",
    "              .rename_axis('experiment')\n",
    "              .assign(human_and_mouse_droplets=lambda x:\n",
    "                  x.human_droplets + x.mouse_droplets - x.nonempty_droplets)\n",
    "              .assign(multiplet_freq=lambda x:\n",
    "                  multipletFreq(x.human_droplets,\n",
    "                                x.mouse_droplets,\n",
    "                                x.human_and_mouse_droplets))\n",
    "              )\n",
    "\n",
    "df_unequal.round(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Write results to LaTex tables for paper\n",
    "Finally, we write the example results to LaTex tables to be included in the paper:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Writing equal example data frame to equal_example.tex\n",
      "Writing unequal example data frame to unequal_example.tex\n"
     ]
    }
   ],
   "source": [
    "for (df, dfname) in [(df_equal, 'equal'), (df_unequal, 'unequal')]:\n",
    "    f = '{0}_example.tex'.format(dfname)\n",
    "    print(\"Writing {0} example data frame to {1}\".format(dfname, f))\n",
    "    ncol = len(df.columns) + 1\n",
    "    column_format = 'C{0.66in}' * ncol\n",
    "    (df.round(3)\n",
    "       .rename(columns={col:col.replace('_', ' ') for col in df.columns})\n",
    "       .reset_index()\n",
    "       .to_latex(f, index=False, column_format=column_format)\n",
    "       )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  },
  "toc": {
   "nav_menu": {},
   "number_sections": false,
   "sideBar": true,
   "skip_h1_title": false,
   "toc_cell": true,
   "toc_position": {},
   "toc_section_display": "none",
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}