SWHarden.com

The personal website of Scott W Harden

Using DataFrames in C#

How to use the DataFrame class from the Microsoft.Data.Analysis package to interact with tabular data

The DataFrame is a data structure designed for manipulation, analysis, and visualization of tabular data, and it is the cornerstone of many data science applications. One of the most famous implementations of the data frame is provided by the Pandas package for Python. An equivalent data structure is available for C# using Microsoft’s data analysis package. Although data frames are commonly used in Jupyter notebooks, they can be used in standard .NET applications as well. This article surveys Microsoft’s Data Analysis package and introduces how to interact with with data frames using C# and the .NET platform.

DataFrame Quickstart

Add the Microsoft.Data.Analysis package to your project, then you can create a DataFrame like this:

using Microsoft.Data.Analysis;

string[] names = { "Oliver", "Charlotte", "Henry", "Amelia", "Owen" };
int[] ages = { 23, 19, 42, 64, 35 };
double[] heights = { 1.91, 1.62, 1.72, 1.57, 1.85 };

DataFrameColumn[] columns = {
    new StringDataFrameColumn("Name", names),
    new PrimitiveDataFrameColumn<int>("Age", ages),
    new PrimitiveDataFrameColumn<double>("Height", heights),
};

DataFrame df = new(columns);

Contents of a DataFrame can be previewed using Console.WriteLine(df) but the formatting isn’t pretty.

Name  Age   Height
Oliver23    1.91
Charlotte19    1.62
Henry 42    1.72
Amelia64    1.57
Owen  35    1.85

Pretty DataFrame Formatting

A custom PrettyPrint() extension method can improve DataFrame readability. Implementing this as an extension method allows me to call df.PrettyPrint() anywhere in my code.

💡 View the full PrettyPrinters.cs source code
using Microsoft.Data.Analysis;
using System.Text;

internal static class PrettyPrinters
{
    public static void PrettyPrint(this DataFrame df) => Console.WriteLine(PrettyText(df));
    public static string PrettyText(this DataFrame df) => ToStringArray2D(df).ToFormattedText();

    public static string ToMarkdown(this DataFrame df) => ToStringArray2D(df).ToMarkdown();

    public static void PrettyPrint(this DataFrameRow row) => Console.WriteLine(Pretty(row));
    public static string Pretty(this DataFrameRow row) => row.Select(x => x?.ToString() ?? string.Empty).StringJoin();
    private static string StringJoin(this IEnumerable<string> strings) => string.Join(" ", strings.Select(x => x.ToString()));

    private static string[,] ToStringArray2D(DataFrame df)
    {
        string[,] strings = new string[df.Rows.Count + 1, df.Columns.Count];

        for (int i = 0; i < df.Columns.Count; i++)
            strings[0, i] = df.Columns[i].Name;

        for (int i = 0; i < df.Rows.Count; i++)
            for (int j = 0; j < df.Columns.Count; j++)
                strings[i + 1, j] = df[i, j]?.ToString() ?? string.Empty;

        return strings;
    }

    private static int[] GetMaxLengthsByColumn(this string[,] strings)
    {
        int[] maxLengthsByColumn = new int[strings.GetLength(1)];

        for (int y = 0; y < strings.GetLength(0); y++)
            for (int x = 0; x < strings.GetLength(1); x++)
                maxLengthsByColumn[x] = Math.Max(maxLengthsByColumn[x], strings[y, x].Length);

        return maxLengthsByColumn;
    }

    private static string ToFormattedText(this string[,] strings)
    {
        StringBuilder sb = new();
        int[] maxLengthsByColumn = GetMaxLengthsByColumn(strings);

        for (int y = 0; y < strings.GetLength(0); y++)
        {
            for (int x = 0; x < strings.GetLength(1); x++)
            {
                sb.Append(strings[y, x].PadRight(maxLengthsByColumn[x] + 2));
            }
            sb.AppendLine();
        }

        return sb.ToString();
    }


    private static string ToMarkdown(this string[,] strings)
    {
        StringBuilder sb = new();
        int[] maxLengthsByColumn = GetMaxLengthsByColumn(strings);

        for (int y = 0; y < strings.GetLength(0); y++)
        {
            for (int x = 0; x < strings.GetLength(1); x++)
            {
                sb.Append(strings[y, x].PadRight(maxLengthsByColumn[x]));
                if (x < strings.GetLength(1) - 1)
                    sb.Append(" | ");
            }
            sb.AppendLine();

            if (y == 0)
            {
                for (int i = 0; i < strings.GetLength(1); i++)
                {
                    int bars = maxLengthsByColumn[i] + 2;
                    if (i == 0)
                        bars -= 1;
                    sb.Append(new String('-', bars));

                    if (i < strings.GetLength(1) - 1)
                        sb.Append("|");
                }
                sb.AppendLine();
            }
        }

        return sb.ToString();
    }
}
Name       Age  Height
Oliver     23   1.91
Charlotte  19   1.62
Henry      42   1.72
Amelia     64   1.57
Owen       35   1.85

I can create similar methods to format a DataFrame as Markdown or HTML.

Name      | Age | Height
----------|-----|--------
Oliver    | 23  | 1.91
Charlotte | 19  | 1.62
Henry     | 42  | 1.72
Amelia    | 64  | 1.57
Owen      | 35  | 1.85
Name Age Height
Oliver 23 1.91
Charlotte 19 1.62
Henry 42 1.72
Amelia 64 1.57
Owen 35 1.85

Using DataFrames in Interactive Notebooks

To get started using .NET workbooks, install the .NET Interactive Notebooks extension for VS Code, create a new demo.ipynb file, then add your code.

Previously users had to create custom HTML formatters to properly display DataFrames in .NET Interactive Notebooks, but these days it works right out of the box.

💡 See demo.html for a full length demonstration notebook

// visualize the DataFrame
df

Append a Row

Build a new row using key/value pair then append it to the DataFrame

List<KeyValuePair<string, object>> newRowData = new()
{
    new KeyValuePair<string, object>("Name", "Scott"),
    new KeyValuePair<string, object>("Age", 36),
    new KeyValuePair<string, object>("Height", 1.65),
};

df.Append(newRowData, inPlace: true);

Add a Column

Build a new column, populate it with data, and add it to the DataFrame

int[] weights = { 123, 321, 111, 121, 131 };
PrimitiveDataFrameColumn<int> weightCol = new("Weight", weights);
df.Columns.Add(weightCol);

Sort and Filter

The DataFrame class has numerous operations available to sort, filter, and analyze data in many different ways. A popular pattern when working with DataFrames is to use method chaining to combine numerous operations together into a single statement. See the DataFrame Class API for a full list of available operations.

df.OrderBy("Name")
    .Filter(df["Age"].ElementwiseGreaterThan(30))
    .PrettyPrint();
Name    Age  Height
Henry   42   1.72
Oliver  23   1.91
Owen    35   1.85

Mathematical Operations

It’s easy to perform math on columns or across multiple DataFrames. In this example we will perform math using two columns and create a new column to hold the output.

DataFrameColumn iqCol = df["Age"] * df["Height"] * 1.5;

double[] iqs = Enumerable.Range(0, (int)iqCol.Length)
    .Select(x => (double)iqCol[x])
    .ToArray();

df.Columns.Add(new PrimitiveDataFrameColumn<double>("IQ", iqs));
df.PrettyPrint();
Name       Age  Height  IQ
Oliver     23   1.91    65.9
Charlotte  19   1.62    46.17
Henry      42   1.72    108.36
Amelia     64   1.57    150.72
Owen       35   1.85    97.12

Statistical Operations

You can iterate across every row of a column to calculate population statistics

foreach (DataFrameColumn col in df.Columns.Skip(1))
{
    // warning: additional care must be taken for datasets which contain null
    double[] values = Enumerable.Range(0, (int)col.Length).Select(x => Convert.ToDouble(col[x])).ToArray();
    (double mean, double std) = MeanAndStd(values);
    Console.WriteLine($"{col.Name} = {mean} +/- {std:N3} (n={values.Length})");
}
Age = 36.6 +/- 15.982 (n=5)
Height = 1.734 +/- 0.130 (n=5)
💡 View the full MeanAndStd() source code
private static (double mean, double std) MeanAndStd(double[] values)
{
	if (values is null)
		throw new ArgumentNullException(nameof(values));

	if (values.Length == 0)
		throw new ArgumentException($"{nameof(values)} must not be empty");

	double sum = 0;
	for (int i = 0; i < values.Length; i++)
		sum += values[i];

	double mean = sum / values.Length;

	double sumVariancesSquared = 0;
	for (int i = 0; i < values.Length; i++)
	{
		double pointVariance = Math.Abs(mean - values[i]);
		double pointVarianceSquared = Math.Pow(pointVariance, 2);
		sumVariancesSquared += pointVarianceSquared;
	}

	double meanVarianceSquared = sumVariancesSquared / values.Length;
	double std = Math.Sqrt(meanVarianceSquared);

	return (mean, std);
}

Plot Values from a DataFrame

I use ScottPlot.NET to visualize data from DataFrames in .NET applications and .NET Interactive Notebooks. ScottPlot can generate a variety of plot types and has many options for customization. See the ScottPlot Cookbook for examples and API documentation.

// Register a custom formatter to display ScottPlot plots as images
using Microsoft.DotNet.Interactive.Formatting;
Formatter.Register(typeof(ScottPlot.Plot), (plt, writer) => 
    writer.Write(((ScottPlot.Plot)plt).GetImageHTML()), HtmlFormatter.MimeType);
// Get the data you wish to display in double arrays
double[] ages = Enumerable.Range(0, (int)df.Rows.Count).Select(x => Convert.ToDouble(df["Age"][x])).ToArray();
double[] heights = Enumerable.Range(0, (int)df.Rows.Count).Select(x => Convert.ToDouble(df["Height"][x])).ToArray();
// Create and display a plot
var plt = new ScottPlot.Plot(400, 300);
plt.AddScatter(ages, heights);
plt.XLabel("Age");
plt.YLabel("Height");
plt

💡 See demo.html for a full length demonstration notebook

If you are only working inside a Notebook and you want all your plots to be HTML and JavaScript, XPlot.Plotly is a good tool to use.

Data may contain null

I didn’t demonstrate it in the code examples above, but note that all column data types are nullable. While null-containing data requires extra considerations when writing mathematical routes, it’s a convenient way to model missing data which is a common occurrence in the real world.

Why not just use LINQ?

I see this question asked frequently, often with an aggressive and condescending tone. LINQ (Language-Integrated Query) is fantastic for performing logical operations on simple collections of data. When you have large 2D datasets of labeled data, advantages of data frames over flat LINQ statements start to become apparent. It is also easy to perform logical operations across multiple data frames, allowing users to write simpler and more readable code than could be achieved with LINQ statements. Data frames also make it much easier to visualize complex data too. In the data science world where complex labeled datasets are routinely compared, manipulated, merged, and visualized, often in an interactive context, the data frames are much easier to work with than raw LINQ statements.

Conclusions

Although I typically reach for Python to perform exploratory data science, it’s good to know that C# has a DataFrame available and that it can be used to inspect and manipulate tabular data. DataFrames pair well with ScottPlot figures in interactive notebooks and are a great way to inspect and communicate complex data. I look forward to watching Microsoft’s Data Analysis namespace continue to evolve as part of their machine learning / ML.NET platform.

Resources