The personal website of Scott W Harden

Speaking Numbers with a Microcontroller

How to encode WAV files into C code that can be replayed from memory

This page describes the technique I used to get a microcontroller to speak numbers out loud. Reading numbers from a speaker is an interesting and simple alternative to displaying numbers on a display, which often requires complex multiplexing circuitry and/or complex software to drive the display. This page describes the techniques I used to extract audio waveforms from MP3 files and encode them into data that can be stored in the microcontroller’s flash memory.

Because microcontrollers have a limited amount of flash memory this method is not suitable for long recordings, but it is fine for storing a few seconds of audio at a limited sample rate. Unlike more common methods for playing audio with a microcontroller, playing audio from program memory does not require a SD card, special hardware, or complex audio decoding software. Although this technique works best when a speaker is driven with a amplifier circuit, I found acceptable audio can be produced by driving a speaker directly from a microcontroller pin. This technique makes it possible to play surprisingly good audio without requiring any components other than a speaker.

NumberSpeaker Arduino Library

The code described on this page has been packaged into the NumberSpeaker library that can be installed from the Arduino library manager. Users who just wish to have some numbers read out loud can use this library and not hassle with the complex techniques described lower on this page. Source code is available on GitHub, but to get started using it you can perform the following steps:

#include "NumberSpeaker.h"

NumberSpeaker numberSpeaker = NumberSpeaker();

void setup() {
  numberSpeaker.begin();  // speaker on pin 11

void loop() {
  unsigned int count = 0;
  for (;;) {

Theory of Operation

Encoding the Audio

After some trial and error I found that 5 kHz 8-bit data is a good balance of quality vs. size for storing a human voice in program memory. The 5kHz sample rate has a 2.5 kHz Nyquist frequency, meaning it can reproduce audio from 0-2.5 kHz which I found acceptable for voice. For music I found it helps to increase sample rate to 8 kHz. To reduce encoding artifacts that may result from such an aggressive reduction in sample rate, a low-pass filter is useful to apply to the audio before resampling it. Resampling should also be performed using interpolation, allowing the smooth reduction of sample rate by an arbitrary scale factor.

Full source code is available on the NumberSpeaker GitHub project.

Here’s a slimmed-down version of the Python code I used to perform these operations:

# read audio values from file
ys, sample_rate = librosa.load("zero.mp3")

# lowpass filter
new_sample_rate = 5_000
freq_low = 150
freq_high = new_sample_rate/2
sos = scipy.signal.butter(6, [freq_low, freq_high], 'bandpass', fs=sample_rate, output='sos')
ys = scipy.signal.sosfilt(sos, ys)

# resample with interpolation
xs_new = numpy.arange(len(ys)) / new_sample_rate
xs = numpy.arange(len(ys)) / sample_rate
ys = numpy.interp(xs_new, xs, ys)

# scale to [0, 255] and quantize
ys = ys / max(abs(min(ys)), abs(max(ys)))
ys = ys / 2 + .5
ys = ys * 255
ys = ys.astype(numpy.int16)

I then used python to loop across a folder of MP3 files and generate a C header file containing the waveforms for each recording stored as a byte array:

const uint8_t AUDIO_1[] PROGMEM = { 129, 125, ..., 130, 132 };
const uint8_t AUDIO_2[] PROGMEM = { 127, 123, ..., 134, 130 };
const uint8_t AUDIO_3[] PROGMEM = { 122, 124, ..., 137, 135 };

Waveforms are stored in program memory using the PROGMEM keyword and later retrieved using pgm_read_byte() function. See AVR-GCC’s Program Space Utilities documentation for additional information.

Playing the Audio

An 8-bit timer is used to generate the PWM signal that creates the audio waveform. I used Timer2 to generate this signal because Arduino uses Timer0 for its own tasks. I also setup the Timer2 settings myself to ensure it would run at maximum speed (ideal for generating waveforms using a simple R/C lowpass filter). Running with the 16 MHz system clock with no prescaler, the 8-bit timer overflows at a rate of 62.5 kHz.

// Use Timer2 to generate an analog voltage using PWM on pin 11
pinMode(11, OUTPUT); 

// Set OC2A on BOTTOM, clear OC2A on compare match
TCCR2A |= bit(COM2A1);

// Clock at CPU rate without prescaling.
TCCR2B = bit(CS20);

// Set the initial level to half the positive rail voltage
OCR2A = 127;

Playback is achieved by setting PWM duty from the stored audio data. Playback speed can be customized by adjusting the delay between sample advancements.

for (int i = 0; i < sizeof(AUDIO_1); i++) {
    uint8_t value = pgm_read_byte(&AUDIO_1[i]);
    OCR2A = value; // 8-bit timer generating the PWM signal
    delayMicroseconds(200); // approximate delay for 5 kHz sample rate

Memory Management

Using the strategies above I was able to encode numbers 0-9 and the word “point” using a total of 16,720 bytes (about half of the 32k program memory). I could save space by speeding-up the recordings (reading the numbers faster), but I found the present settings to be a good balance between comfort and memory consumption.

I experimented with strategies that used 4 bytes instead of 8 to store the waveform, but I found them unacceptably noisy. It may be possible to use 4-byte frames to store the difference between each point and the next, effectively halving memory required to store these waveforms. There are many signal compression algorithms we could employ to reduce the memory footprint, but for now the present strategy is working satisfactorily and has the benefit of minimal code complexity.

External memory could be used to store more audio. Arduino (ATMega328P) has 32 KB program memory and it must be shared with the main program code, so this places a restrictive upper limit on the total amount of audio data that may be stored on the chip itself. Users demanding more storage may benefit from interfacing a SPI flash memory IC. For example, W25Q32JV has 4 MB of flash memory and is available on Mouser for $0.76 each. At 5 kHz, a 4 MB flash memory chip could store over thirteen minutes of 8-bit audio. There’s even a SPIMemory library for Arduino. However, some thought must be placed into how to program the flash memory with the audio waveform on your computer.

Audio Amplification

Driving headphones or a speaker directly with a PWM output pin works, but amplifying the signal substantially improves the quality of the sound. The LM386 single chip audio amplifier is technically obsolete (discontinued in 2016), but it’s commonly used in hobby circles because clones are ubiquitously available in DIP packages which make them convenient for breadboarding. The LM4862 is a better choice for serious designs as it is currently in production and does not require large electrolytic capacitors. I’m using the old LM386 here with the minimum of components and the default 20x gain experienced when pins 1 and 8 are left floating.

This is audio amplifier circuit I ended-up using for this project. The combination of a 10 kΩ resistor and 0.1 µF capacitor formed a low-pass filter with -3 dB cutoff of about 160 Hz. This seems extremely aggressive since the 8-bit PWM frequency is 62.5 kHz, but I found this setup worked pretty well on my bench.

Arduino Demo

Here you can observe the digital PWM waveform (yellow) next to the low-pass filtered analog signal (blue). I acknowledge that the oscilloscope demonstrates high frequency noise all over the place, but for the present application it doesn’t really matter.

Playing Audio from Newer AVR Chips

Let’s leave Arduino behind and switch to one of the more modern AVR microcontroller chips. These newer 8-bit AVR microcontrollers offer a superior set of peripherals at lower cost and have greater availability. Unlike older AVR chips, these new AVR microcontrollers cannot be flashed using ICSP programmers. See my Programming Modern AVR Microcontrollers page to learn how to use inexpensive gear to program the latest family of AVR chips with a UDPI programmer.

The AVR64DD32 microcontroller is one of the highest-end 8-bit AVRs currently on the market. It has 64 KB flash memory, and this increased size allowed me to experiment with storing longer recordings and at a higher sample rate. It’s worth noting that the DA and DB family of chips has products with 128 KB of flash memory. Array lengths are limited to 32 KB, but I was able to circumvent this restriction by storing audio across multiple arrays and adding extra logic to ensure the correct source array is sampled.

These chips do not come in DIP packages, but QFP/DIP breakout boards make them easy to experiment with in a breadboard.

AVR64 DD Code

Two timers can be configured to achieve audio playback that does not block the main program. This strategy uses two clocks to achieve asynchronous audio playback using interrupts to advance the waveform so it does not block main program execution.

Use the 24 MHz internal clock for the fastest PWM possible:

#define F_CPU 24000000UL
CCP = CCP_IOREG_gc; // Protected write

Setup the 8-bit TimerB to produce the PWM signal:

// Enable PWM output on pin 32
// (don't forget to set the PORT direction)

// Make waveform output available on the pin

// Enable 8-bit PWM mode

// Set period and duty
TCB0.CCMPL = 255; // top value
TCB0.CCMPH = 50; // flip value

Setup the 16-bit TimerA to advance the waveform at 5 kHz:

// enable Timer A

// Overflow triggers interrupt

// Set period and duty
TCA0.SINGLE.PER = F_CPU/5000; // 5 kHz audio

Populate the overflow event code:

    // read next level from program memory
    uint8_t level = pgm_read_byte(&AUDIO_SAMPLES[AUDIO_INDEX++]);
    // wait until next rollover to reduce static
    while(TCB0.CNT > 0){}

    // update the PWM level
	TCB0.CCMPH = level;

    // rollover the index to loop the audio
        AUDIO_INDEX = 0;

    // indicate the interrupt was handled

Enable global interrupts:


Full source code is on GitHub:

Use the AVR’s DAC for Audio Playback

Modern 8-bit AVRs have a 10-bit digital-to-analog converter (DAC) built in. It’s simpler to setup and use than a discrete timer/counter in PWM mode. Although the code above uses the AVR’s timer/counter B (TCB) to generate the analog waveform, this method is recommended when a DAC is available:

// Enable the DAC and output on pin 16

// Set the DAC level
uint8_t level = 123; // Retrieved from memory
DAC0.DATA = level << 8; // Shift to use the highest bits

Playing Music from an AVR DD Series Microcontroller

YouTube has a surprisingly large number of videos of people beeping the Coffin Dance song from an Arduino, so here’s my video response playing the actual song audio encoded as an 8 kHz 8-bit waveform on an AVR64DD32 microcontroller. The original song is Astronomia by Vicetone and Tony Igy. Refer to Know Your Meme: Coffin Dance for more information about internet culture.

Additional Resources

Summarizing Blog Posts with AI

How I used Python to run the Llama 2 large language model locally and convert my rambling childhood blog posts into concise summaries

This page describes how I used a locally hosted generative AI to read blog posts I made as a child and summarize them using more professional language. In the early 2000s there was no Twitter, and the equivalent was making short blog posts every few days. Today this website is effectively a technology blog, and although I’m proud to have been blogging for over twenty years, I’m embarrassed enough by some of the content I wrote as a child that I have hundreds of posts hidden. I like the idea of respecting my past by keeping those old posts up in some form, and I hope that generative AI can help me do that using more professional language more consistent with the present tone of this website.


This article builds upon my previous two articles:

Review the information on those pages for details about setting-up the Python environment required to use large language models and generative AI locally to answer questions about information contained in documents on the local filesystem.


The following code uses LangChain to provide the tools for interpretation and generative AI. Specifically, the all-MiniLM-L6-v2 sentence transformer is used to interpret content of the old blog posts and store it in a FAISS vector database, then the Llama 2 large language model is used to answer questions about its content. I am thankful to have kept all of my original blog posts (translated into Markdown format), so processing them individually is not be too complicated:

This script uses AI to interpret old blog posts and generate 
new ones containing one-sentence summaries of their content.

import pathlib
from langchain.document_loaders import DirectoryLoader, TextLoader, UnstructuredMarkdownLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.llms import CTransformers
from langchain import PromptTemplate
from langchain.chains import RetrievalQA

def analyze(file: pathlib.Path) -> FAISS:
    Interpret an old blog post in Markdown format
    and return its content as a vector database.
    loader = UnstructuredMarkdownLoader(file)
    data = loader.load()
    splitter = RecursiveCharacterTextSplitter(chunk_size=500,
    texts = splitter.split_documents(data)
    embeddings = HuggingFaceEmbeddings(
        model_kwargs={'device': 'cpu'})
    return FAISS.from_documents(texts, embeddings)

def build_llm(db: FAISS) -> RetrievalQA:
    Create an AI Chat large language model prepared
    to answer questions about content of the database

    template = """
    Use information from the following blog post to answer questions about its content.
    If you do not know the answer, say that you do not know, and do not try to make up an answer.
    Context: {context}
    Question: {question}
    Only return the helpful answer below and nothing else.
    Helpful answer:

    llm = CTransformers(model='../models/llama-2-7b-chat.ggmlv3.q8_0.bin',
                        config={'max_new_tokens': 256, 'temperature': 0.01})
    retriever = db.as_retriever(search_kwargs={'k': 2})
    prompt = PromptTemplate(
        input_variables=['context', 'question'])
    return RetrievalQA.from_chain_type(llm=llm,
                                       chain_type_kwargs={'prompt': prompt})

def summarize_article(file: pathlib.Path):
    Summarize the content of an old markdown file and
    save the result in a new markdown file.
    output_file = pathlib.Path("summaries").joinpath(file.name)
    db = analyze(file)
    qa_llm = build_llm(db)
    output = qa_llm(
        {'query': "What is a one sentence summary of this blog post?"})
    summary = str(output["result"]).strip()
    with open(output_file, 'w') as f:

if __name__ == "__main__":
    files = sorted(list(pathlib.Path("posts").glob("*.md")))
    for i, file in enumerate(files):
        print(f"summarizing {i+1} of {len(files)}")


Summarizing 355 blog posts took a little over 6 hours, averaging about one minute each. Interestingly longer articles did not take significantly longer to summarize. The following chart shows the time it took for the Q8 model to summarize each of the 355 old blog posts.

Note that I used ScottPlot.NET to create this chart using C#.

// Read data about each post from the log file
string[] lines = File.ReadAllLines("log.tsv");
List<double> logWords = new();
List<double> times = new();
foreach (string line in lines)
    string[] strings = line.Split('\t');
    if (strings.Length != 6)


// Plot article length vs processing time
ScottPlot.Plot plot = new(600, 400);
plot.AddScatterPoints(logWords.ToArray(), times.ToArray());
static string FormatTick(double value) => Math.Pow(10, value).ToString();
plot.SetAxisLimits(1, 4, 25, 100);
plot.Title("AI Blog Post Summary Performance");
plot.XLabel("Article Length (Number of Words)");
plot.YLabel("Processing Time (Seconds)");

Comparing Summaries by Model

I found it interesting to compare summaries generated by the Q8 model against the smaller Q2_K model. Model descriptions can be found on the Hugging Face readme page.

Metric Larger Model Smaller Model
Specific Model (Llama 2 7B) chat.ggmlv3.q8_0 chat.ggmlv3.q2_K
Model Size on Disk 7.0 GB 2.8 GB
Mean Response Time 62.4 ± 1.8 sec 25.4 ± 0.8 sec

The smaller model ran in less than half the time and typically produced adequate summaries, but often the larger model produces summaries which were slightly better and had a little more personality. For this reason I used the larger model to generate all summaries I ended-up using in the end.

Larger Model Smaller Model
The author has lost all their hard work and progress on their websites due to technical issues, and they are unsure if they will try to recreate them or keep this page as a reminder of their past efforts. The author has lost all of their work on their websites, including thousands of lines of code, and is unlikely to try to recreate them. Instead, they will keep this page as a reminder of the time and effort that went into creating their websites.
The author is unsure of what they will do with their website in the future, but they plan to continue adding to it sporadically as a way to keep it updated and active. The blog post is about the author’s plans for their website, including creating a new page or section for it, and using it as an almost blog style format to share information with readers.
The author is trying to create a bootable CD with the original Quake game that supports soundcard and mouse functionality. The author of the blog post is trying to create a bootable CD with the original Quake game on it that supports soundcards and mice, so that they can play the game at LAN parties without having to worry about setting up the operating system or dealing with any issues related to the game.
The writer is experiencing hostility from their parents due to their online writing, specifically their weblog, and is finding it difficult to write normally as a result. The writer is upset that their parents are reading and misinterpreting their blog posts, leading to incorrect assumptions about their intentions.
The writer is worried about something someone said to them and is trying to figure it out by writing it down and hoping to understand it better by morning. The author is worried about something someone said to them and is trying to figure it out by going to sleep and thinking about it overnight.
The author of the blog post is claiming that someone stole pictures from their website and passed them off as their own, and they got revenge by creating fake pictures of the thief’s room and linking to them from their site, causing the thief to use a lot of bandwidth. The blog post is about someone who found pictures of their room on another person’s website, claimed they were their own, and linked to them from their site without permission.

About 1 in 50 summaries were totally off base. One of them claimed my blog article described the meaning of a particular music video, and after watching it on YouTube I’m confident that I had never seen the video or heard the song.

About 1 in 100 summaries lacked sufficient depth for me to be satisfied with the result (because they were very long and/or very important to me) so I will be doing a paragraph-by-paragraph summarization and manual editing of the result to ensure the spirit of the article is sufficiently captured.

Overall, this was an impressive amount of work achieved in a single day. The source material contained almost half a million words in total, and I was able to adequately review all of that content in a single day!

Amusing Findings From the Past

In the process of reviewing the AI-generated blog post summaries, I found found a few things that made me chuckle:

November 25, 2003 (Summary): The author of the blog post is discussing how they were unable to find a good way to summarize their own blog posts, despite trying various methods.

Funny how that one came around full circle!

March 21, 2005 (Summary): I can’t believe I’m making so many audioblogs! Download the latest one on the Audioblogs page (a link is on the right side of the page).

The term “Audioblogs” was used before the word “podcast” was popular. I remember recording at least a few episodes, and I’m really disappointed they weren’t preserved with the rest of my blog posts. I also found references to “Videoblogs” or “Vlogs”, and I remember recording at least two episodes of that, but I think those too have been lost to history.

June 4, 2005 (Summary): The author of the blog uses WordPress as their blogging engine instead of MovableType, which they previously used but grew to hate due to speed issues.

That one came full circle too! My blog started as a static site, went to a database-driven site, then worked its way all the back to static where it is today. My best guess at the dates of the various technologies is:


Completing this project allowed me to add 325 previously hidden blog posts back onto my website, bringing the total number of posts from 295 to 620! It took several hours for the the AI tools to summarize almost half a milllion words of old blog posts, then a few of my hours spot-checking the summaries and tweaking them where needed, but overall I’m very satisfied by how how I was able to wield generative AI to assist the transformation of such a large amount of text in a single day.

Using Llama 2 to Answer Questions About Local Documents

How to use open source large language models to ingest information from local files and generate AI responses to questions about their content

This page describes how I use Python to ingest information from documents on my filesystem and run the Llama 2 large language model (LLM) locally to answer questions about their content. My ultimate goal with this work is to evaluate feasibility of developing an automated system to digest software documentation and serve AI-generated answers to technical questions based on the latest available information.

Quickstart: The previous post Run Llama 2 Locally with Python describes a simpler strategy to running Llama 2 locally if your goal is to generate AI chat responses to text prompts without ingesting content from local documents.

Environment Setup

Download a Llama 2 model in GGML Format

The model I’m using (7B Q8 0) is good but heavy, requiring 7 GB disk space and 10 GB ram. Lighter and faster models are available. I recommend trying different models, and if your application can use a simpler model it may result in significantly improved performance.

Create a Virtual Environment

Install Required Packages

Altogether we will need the langchain, sentence-transformers, faiss-cpu, and ctransformers packages:

pip install langchain sentence_transformers faiss-cpu ctransformers

Installing these packages requires a C++ compiler. To get one:

Prepare Documents for Ingestion

I created a few plain text files that contain information for testing. I am careful to be specific enough (and absurd enough) that we can be confident when the AI chat is referencing this material when answering responses about it later.


Scott William Harden is an open-source software developer. He is the primary author of ScottPlot, pyabf, FftSharp, Spectrogram, and several other open-source packages. Scott’s favorite color is dark blue despite the fact that he is colorblind. Scott’s advocacy for those with color vision deficiency (CVD) leads him to recommend perceptually uniform color maps like Viridis instead of traditional color maps like Spectrum and Jet.


“JupyterGoBoom” is the name of a Python package for creating unmaintainable Jupyter notebooks. It is no longer actively developed and is now considered obsolete because modern software developers have come to realize that Jupyter notebooks grow to become unmaintainable all by themselves.

Interpreting Content of Local Files

We can use the MiniLM language model to interpret the content our documents and save that information in a vector database so AI chat has access to it. The interpreted information is saved to disk in the FAISS (Facebook AI Similarity Search) file format, a vector database optimized for searching for similarly across large and high dimensional datasets.

This script creates a database of information gathered from local text files.

from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

# define what documents to load
loader = DirectoryLoader("./", glob="*.txt", loader_cls=TextLoader)

# interpret information in the documents
documents = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=500,
texts = splitter.split_documents(documents)
embeddings = HuggingFaceEmbeddings(
    model_kwargs={'device': 'cpu'})

# create and save the local database
db = FAISS.from_documents(texts, embeddings)

Although this example saves the vector database to disk so it can be loaded later by a large language model, you do not have to write the vector database to disk if you only intend to consume it in the same Pythons script. In this case you can omit the db.save_local() and consume the db object later instead of calling FAISS.load_local(), keeping the database in memory the whole time.

Users may desire to customize arguments of the various methods listed here to improve behavior in application-specific ways. For example, RecursiveCharacterTextSplitter() has optional keyword arguments for chunk_size and chunk_overlap which are often customized to best suit the type of content being ingested. See Chunking Strategies for LLM Applications, What Chunk Size and Chunk Overlap Should You Use? and the LangChain documentation for more information.

LangChain has advanced tools available for ingesting information in complex file formats like PDF, Markdown, HTML, and JSON. Plain text files are used in this example to keep things simple, but more information is available in the official documentation.

Prepare an AI That is Aware of Local File Content

We can now prepare an AI Chat from a LLM pre-loaded with information contained in our documents and use it to answer questions about their content.

This script reads the database of information from local text files
and uses a large language model to answer questions about their content.

from langchain.llms import CTransformers
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain import PromptTemplate
from langchain.chains import RetrievalQA

# prepare the template we will use when prompting the AI
template = """Use the following pieces of information to answer the user's question.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Context: {context}
Question: {question}
Only return the helpful answer below and nothing else.
Helpful answer:

# load the language model
llm = CTransformers(model='./llama-2-7b-chat.ggmlv3.q8_0.bin',
                    config={'max_new_tokens': 256, 'temperature': 0.01})

# load the interpreted information from the local database
embeddings = HuggingFaceEmbeddings(
    model_kwargs={'device': 'cpu'})
db = FAISS.load_local("faiss", embeddings)

# prepare a version of the llm pre-loaded with the local content
retriever = db.as_retriever(search_kwargs={'k': 2})
prompt = PromptTemplate(
    input_variables=['context', 'question'])
qa_llm = RetrievalQA.from_chain_type(llm=llm,
                                     chain_type_kwargs={'prompt': prompt})

# ask the AI chat about information in our local files
prompt = "Who is the author of FftSharp? What is their favorite color?"
output = qa_llm({'query': prompt})


Here are the answers the script above gave me the the following questions:

Question: Who is the author of FftSharp? What is their favorite color?

Response: Scott William Harden is the author of FftSharp. According to Scott, his favorite color is dark blue despite being colorblind.

Question: Why is JupyterGoBoom obsolete?

Response: JupyterGoBoom is considered obsolete because modern software developers have come to realize that Jupyter notebooks become unmaintainable all by themselves.

Both answers are consistent with the information written in the ingested text documents. The AI is successfully answering questions about our custom content!

Jupyter Notebook

Here is a standalone Jupyter notebook that demonstrates how to ingest information from documents and interact with a large language model to have AI chat answer questions about their content. This notebook contains a few extra features to improve formatting of the output as well.

Ideas and Future Directions

The recent availability of open source large language models that can be run locally offers many interesting opportunities that were not feasible with cloud-only offerings like ChatGPT. For example, consider that your application involves large amounts of sensitive data. You can ingest all the documents and ask AI chat specific questions about their content all on a local machine. This permits development of AI tools that can safely interact with high security operations requiring intellectual property protection, handling of medical data, and other sensitive applications where information security is paramount.

As a maintainer of several open-source projects, I am curious to learn if this strategy may be useful for generating automated responses to questions posted in GitHub issues that have answers which already exist in the software documentation. It would be interesting to incorporate AI content ingestion into the CI/CD pipeline of my software projects so that each software release would be paired with an AI chat bot that provides answers to technical questions with version-specific guidance based upon the documentation generated during that same build workflow. I am also curious about the effectiveness of pairing ingested documentation with GitHub Issues and Discussions so that responses to user questions can point to recommended threads where human conversation may be useful.

There are many avenues to explore from here, and I’m thankful for all of the developers and engineering teams that provide these models and tools for anyone to experiment with.

Update: Google Took My Fake Training Data Seriously

The day after I wrote this article I was working on an unrelated project that resulted in me Googling myself. I was surprised to see that the silly training text above is now Google’s description of my domain. I hope it will change again when I add a new blog post and that text ends up lower down on the page. This is a good reminder that in the world of automated document evaluation and AI-assisted generation, any text you share online may be consumed by robots and fed to others in ways you may not be able to control.


Run Llama 2 Locally with Python

How to run open source large language models locally to provide AI chat responses to text prompts

This page describes how to interact with the Llama 2 large language model (LLM) locally using Python, without requiring internet, registration, or API keys. We will deliver prompts to the model and get AI-generated chat responses using the llama-cpp-python package.

Step 1: Download a Large Language Model

The Llama 2 model can be downloaded in GGML format from Hugging Face:

The model I’m using here is the largest and slowest one currently available. It is 7 GB in size and requires 10 GB of ram to run. Developers should experiment with different models, as simpler models may run faster and produce similar results for less complex tasks.

Step 2: Prepare the Python Environment

Installation will fail if a C++ compiler cannot be located. To get one:

Step 3: Interact with the Llama 2 large language model

Create a new python script and run it inside the virtual environment:

# load the large language model file
from llama_cpp import Llama
LLM = Llama(model_path="./llama-2-7b-chat.ggmlv3.q8_0.bin")

# create a text prompt
prompt = "Q: What are the names of the days of the week? A:"

# generate a response (takes several seconds)
output = LLM(prompt)

# display the response

It took my system 7 seconds to generate this response:

Question: What are the names of the days of the week?

Answer: The names of the days of the week, in order, are: Monday Tuesday Wednesday Thursday Friday Saturday Sunday

Generating Longer Responses

To get longer responses to complex questions, set n_ctx to a larger number to increase the number of tokens the large language model uses for context, then remove the max_tokens limit for the query. While this allows longer responses, it can significantly increase the total time required to generate a response.

from llama_cpp import Llama

# define n_ctx manually to permit larger contexts
LLM = Llama(model_path="./llama-2-7b-chat.ggmlv3.q8_0.bin", n_ctx=2048)

# create a text prompt
prompt = "Q: Why are Jupyter notebooks difficult to maintain? A:"

# set max_tokens to 0 to remove the response size limit
output = LLM(prompt, max_tokens=0)

# display the response

It took my system 1.5 minutes to generate this response:

Question: Why are Jupyter notebooks difficult to maintain?


Jupyter notebooks can be challenging to maintain for several reasons:

  1. Temporary file system: Jupyter notebooks reside on the user’s local disk, which can make them unreliable and difficult to maintain over time.

  2. Version control issues: As notebooks are updated frequently, it becomes challenging to keep track of changes and manage different versions of a notebook. This can lead to conflicts when working with others on the same project.

  3. Lack of organization: Since Jupyter notebooks are created manually, it’s easy to lose track of which files are included in the notebook and how they relate to each other. This makes it challenging to maintain a well-organized structure for large projects with many components.

  4. Limited automation: There are limited tools available for automating tasks within Jupyter notebooks, making it more challenging to create reproducible builds or automate the testing process.

  5. Security concerns: Since Jupyter notebooks contain sensitive data such as user credentials, personal information, and intellectual property, they must be protected against unauthorized access and breaches. This can further complicate maintenance efforts.

  6. Interactive interfaces: Jupyter notebooks are designed for interactive use, which means that any changes to the code or visualizations will affect the interface in real-time. This can make it challenging to maintain a consistent appearance while working on a project.

  7. Limited collaboration support: Jupyter notebooks are primarily designed as an individual tool, which limits their ability to accommodate multiple users and their contributions simultaneously. As a result, collaboration becomes more complicated with larger teams.

  8. Lack of integration with other tools: Jupyter notebooks do not integrate well with other popular development tools like Git or IDEs (Integrated Development Environments). This can make it difficult to manage multiple projects, keep track of changes, and collaborate effectively with team members.

  9. Limited scalability: As projects grow in size and complexity, Jupyter notebooks may become less efficient and more challenging to maintain. This is particularly true for large-scale data science or machine learning projects that require a lot of processing power and memory.

  10. Lack of formalized workflows: Since Jupyter notebooks are primarily designed as an interactive environment, they don’t provide standardized workflows or processes for managing projects. This can result in a less organized and more challenging maintenance experience for project teams.

By understanding these challenges, you can better prepare yourself to manage the complexities of Jupyter notebooks and develop strategies to address them effectively.

Using Llama 2 AI Chat in a Jupyter Notebook

Here is a standalone Jupyter notebook that demonstrates how to use different large language models to generate AI chat responses to plain text prompts. This notebook contains a few extra features to improve formatting of the output as well.

Answering Questions about Local Documents

The AI-generated responses above only contain information built into the model itself. What if we want to give our language model custom information then have it be able to answer questions about it?

My next post Using Llama 2 to Answer Questions About Local Documents explores how to have the AI interpret information from local documents so it can answer questions about their content using AI chat.


Static Site Broken Link Detection

How to check for broken links and images across large static websites

This website is a static site containing thousands of pages, and I recently had the desire to check all of them for broken links and images. Although it would be a tedious task to perform manually, I was able to automate the process using Python. This page describes the process I used to identify broken links and images and may be helpful to others using static site generator tools like Jekyll, Hugo, Eleventy, Pelican, Gatsby, other Jamstack site generators, or hand-written HTML files.

Although my website content is stored as markdown files, I found it most convenient to check the generated HTML files for broken link and image URLs. I generated my static site locally, then manually removed pagination folders so only article HTML files were present. I then wrote the following Python script which uses pathlib to locate HTML files and BeautifulSoup to analyze a href and img src attributes and saves all URLs identified as a CSV file.

import pathlib
from bs4 import BeautifulSoup

def get_urls(file: pathlib.Path) -> list:
    """Return link and image URLs in a HTML file"""
    with open(file, errors='ignore') as f:
        html = f.read()
    soup = BeautifulSoup(html, 'html.parser')
    urls = set()

    for link in soup.find_all('a'):
        href = link.get('href')
        if href.startswith("#"):
        if "#" in href:
            href = href.split("#")[0]
        href = str(href).strip('/')

    for image in soup.find_all('img'):
        src = image.get('src')

    return sorted(list(urls))

def get_urls_by_page(folder: pathlib.Path) -> dict:
    """Return link and image URLs for all HTML files in a folder"""
    html_files = list(folder.rglob("*.html"))
    urls_by_page = {}
    for i, html_file in enumerate(html_files):
        urls = get_urls(html_file)
        urls_by_page[html_file] = urls
        print(f"{i+1} of {len(html_files)}: {len(urls)} URLs found in {html_file}")
    return urls_by_page

def write_csv(urls_by_page: dict, csv_file: pathlib.Path):
    txt = 'URL, Page\n'
    for page, urls in urls_by_page.items():
        for url in urls:
            txt += f'{url}, {page}\n'

if __name__ == "__main__":
    folder = pathlib.Path("public")
    urls_by_page = get_urls_by_page(folder)
    write_csv(urls_by_page, pathlib.Path("urls.csv"))

Running the script generated a CSV report showing every link on my website, organized by which page it’s on. It looks like my blog has over 7,000 links! That would be a lot to check by hand.

Checking URLs

Each URL from the report was checked using HEAD requests. Note that a HEAD request to a file path only returns HTTP headers but does not actually download the file, allowing it to consume far less bandwidth GET requests. Inspecting the HTTP response code indicates whether the URL is a valid path to a file (code 200).

I used python sets to prevent checking the same URL twice. I logged good and broken URLs as they were checked, and consumed these log files at startup to allow the program to be stopped and restarted without causing it to check the same URL twice.

import pathlib
import urllib.request
import time

def is_url_valid(url: str) -> bool:
    """Check if a URL exists without downloading its contents"""
    request = urllib.request.Request(url)
    request.get_method = lambda: 'HEAD'
        urllib.request.urlopen(request, timeout=5)
        return True
    except Exception as e:
        print(f"ERROR: {e}")
        return False

if __name__ == "__main__":

    # load URLs from report
    urls = [x.split(", ")[0] for x
            in pathlib.Path("urls.csv").read_text().split("\n")
            if x.startswith("http")]

    # load previously checked URLs
    url_file_good = pathlib.Path("urls-good.txt")
    url_file_bad = pathlib.Path("urls-bad.txt")
    checked = set()
    for url in url_file_good.read_text().split("\n"):
    for url in url_file_bad.read_text().split("\n"):

    # check each URL
    for i, url in enumerate(urls):
        print(f"{i+1} of {len(urls)}", end=" ")
        if url in checked:
            print(f"SKIPPING {url}")
        print(f"CHECKING {url}")
        log_file = url_file_good if is_url_valid(url) else url_file_bad
        with open(log_file, 'a') as f:

I wrote a Python script to generate a HTML report of all the broken links on my website. The script shows every broken link found on the website and lists all the pages on which each broken link appears.

import pathlib

urls_and_pages = [x.split(", ") for x
                    in pathlib.Path("urls.csv").read_text().split("\n")
                    if x.startswith("http")]

urls_broken = [x for x
                in pathlib.Path("urls-bad.txt").read_text().split("\n")
                if x.startswith("http")]

urls_broken = set(urls_broken)

html = "<h1>Broken Links</h1>"
for url_broken in urls_broken:
    html += f"<div class='mt-4'><a href='{url_broken}'>{url_broken}</a></div>"
    pages = [x[1] for x in urls_and_pages if x[0] == url_broken]
    for page in pages:
        html += f"<div><code>{page}</code></div>"

I wrapped the output in Bootstrap to make the report look pretty and appear properly on mobile devices:

html = f"""<!doctype html>
<html lang="en">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.0-alpha3/dist/css/bootstrap.min.css" rel="stylesheet">
    <div class='container'>


The generated report lets me focus my effort narrowly to tactically fix the broken URLs I think are most important (e.g., image URLs, URLs pointing domain names).