This page describes how I used a locally hosted generative AI to read blog posts I made as a child and summarize them using more professional language. In the early 2000s there was no Twitter, and the equivalent was making short blog posts every few days. Today this website is effectively a technology blog, and although I’m proud to have been blogging for over twenty years, I’m embarrassed enough by some of the content I wrote as a child that I have hundreds of posts hidden. I like the idea of respecting my past by keeping those old posts up in some form, and I hope that generative AI can help me do that using more professional language more consistent with the present tone of this website.
Review the information on those pages for details about setting-up the Python environment required to use large language models and generative AI locally to answer questions about information contained in documents on the local filesystem.
The following code uses LangChain to provide the tools for interpretation and generative AI. Specifically, the all-MiniLM-L6-v2 sentence transformer is used to interpret content of the old blog posts and store it in a FAISS vector database, then the Llama 2 large language model is used to answer questions about its content. I am thankful to have kept all of my original blog posts (translated into Markdown format), so processing them individually is not be too complicated:
"""
This script uses AI to interpret old blog posts and generate
new ones containing one-sentence summaries of their content.
"""importpathlibfromlangchain.document_loadersimport DirectoryLoader, TextLoader, UnstructuredMarkdownLoader
fromlangchain.text_splitterimport RecursiveCharacterTextSplitter
fromlangchain.embeddingsimport HuggingFaceEmbeddings
fromlangchain.vectorstoresimport FAISS
fromlangchain.llmsimport CTransformers
fromlangchainimport PromptTemplate
fromlangchain.chainsimport RetrievalQA
defanalyze(file: pathlib.Path) -> FAISS:
"""
Interpret an old blog post in Markdown format
and return its content as a vector database.
""" loader = UnstructuredMarkdownLoader(file)
data = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=500,
chunk_overlap=50)
texts = splitter.split_documents(data)
embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-MiniLM-L6-v2",
model_kwargs={'device': 'cpu'})
return FAISS.from_documents(texts, embeddings)
defbuild_llm(db: FAISS) -> RetrievalQA:
"""
Create an AI Chat large language model prepared
to answer questions about content of the database
""" template ="""
Use information from the following blog post to answer questions about its content.
If you do not know the answer, say that you do not know, and do not try to make up an answer.
Context: {context} Question: {question} Only return the helpful answer below and nothing else.
Helpful answer:
""" llm = CTransformers(model='../models/llama-2-7b-chat.ggmlv3.q8_0.bin',
model_type='llama',
config={'max_new_tokens': 256, 'temperature': 0.01})
retriever = db.as_retriever(search_kwargs={'k': 2})
prompt = PromptTemplate(
template=template,
input_variables=['context', 'question'])
return RetrievalQA.from_chain_type(llm=llm,
chain_type='stuff',
retriever=retriever,
return_source_documents=True,
chain_type_kwargs={'prompt': prompt})
defsummarize_article(file: pathlib.Path):
"""
Summarize the content of an old markdown file and
save the result in a new markdown file.
""" output_file = pathlib.Path("summaries").joinpath(file.name)
db = analyze(file)
qa_llm = build_llm(db)
output = qa_llm(
{'query': "What is a one sentence summary of this blog post?"})
summary =str(output["result"]).strip()
withopen(output_file, 'w') as f:
f.write(summary)
print(summary)
if __name__ =="__main__":
files =sorted(list(pathlib.Path("posts").glob("*.md")))
for i, file inenumerate(files):
print(f"summarizing {i+1} of {len(files)}")
summarize_article(file)
Summarizing 355 blog posts took a little over 6 hours, averaging about one minute each. Interestingly longer articles did not take significantly longer to summarize. The following chart shows the time it took for the Q8 model to summarize each of the 355 old blog posts.
Note that I used ScottPlot.NET to create this chart using C#.
// Read data about each post from the log filestring[] lines = File.ReadAllLines("log.tsv");
List<double> logWords = new();
List<double> times = new();
foreach (string line in lines)
{
string[] strings = line.Split('\t');
if (strings.Length != 6)
continue;
logWords.Add(Math.Log10(double.Parse(strings[3])));
times.Add(double.Parse(strings[4]));
}
// Plot article length vs processing timeScottPlot.Plot plot = new(600, 400);
plot.AddScatterPoints(logWords.ToArray(), times.ToArray());
plot.XAxis.MinorLogScale(true);
staticstring FormatTick(doublevalue) => Math.Pow(10, value).ToString();
plot.XAxis.TickLabelFormat(FormatTick);
plot.SetAxisLimits(1, 4, 25, 100);
plot.Title("AI Blog Post Summary Performance");
plot.XLabel("Article Length (Number of Words)");
plot.YLabel("Processing Time (Seconds)");
plot.SaveFig("performance.png");
I found it interesting to compare summaries generated by the Q8 model against the smaller Q2_K model. Model descriptions can be found on the Hugging Face readme page.
Metric
Larger Model
Smaller Model
Specific Model (Llama 2 7B)
chat.ggmlv3.q8_0
chat.ggmlv3.q2_K
Model Size on Disk
7.0 GB
2.8 GB
Mean Response Time
62.4 ± 1.8 sec
25.4 ± 0.8 sec
The smaller model ran in less than half the time and typically produced adequate summaries, but often the larger model produces summaries which were slightly better and had a little more personality. For this reason I used the larger model to generate all summaries I ended-up using in the end.
Larger Model
Smaller Model
The author has lost all their hard work and progress on their websites due to technical issues, and they are unsure if they will try to recreate them or keep this page as a reminder of their past efforts.
The author has lost all of their work on their websites, including thousands of lines of code, and is unlikely to try to recreate them. Instead, they will keep this page as a reminder of the time and effort that went into creating their websites.
The author is unsure of what they will do with their website in the future, but they plan to continue adding to it sporadically as a way to keep it updated and active.
The blog post is about the author’s plans for their website, including creating a new page or section for it, and using it as an almost blog style format to share information with readers.
The author is trying to create a bootable CD with the original Quake game that supports soundcard and mouse functionality.
The author of the blog post is trying to create a bootable CD with the original Quake game on it that supports soundcards and mice, so that they can play the game at LAN parties without having to worry about setting up the operating system or dealing with any issues related to the game.
The writer is experiencing hostility from their parents due to their online writing, specifically their weblog, and is finding it difficult to write normally as a result.
The writer is upset that their parents are reading and misinterpreting their blog posts, leading to incorrect assumptions about their intentions.
The writer is worried about something someone said to them and is trying to figure it out by writing it down and hoping to understand it better by morning.
The author is worried about something someone said to them and is trying to figure it out by going to sleep and thinking about it overnight.
The author of the blog post is claiming that someone stole pictures from their website and passed them off as their own, and they got revenge by creating fake pictures of the thief’s room and linking to them from their site, causing the thief to use a lot of bandwidth.
The blog post is about someone who found pictures of their room on another person’s website, claimed they were their own, and linked to them from their site without permission.
About 1 in 50 summaries were totally off base. One of them claimed my blog article described the meaning of a particular music video, and after watching it on YouTube I’m confident that I had never seen the video or heard the song.
About 1 in 100 summaries lacked sufficient depth for me to be satisfied with the result (because they were very long and/or very important to me) so I will be doing a paragraph-by-paragraph summarization and manual editing of the result to ensure the spirit of the article is sufficiently captured.
Overall, this was an impressive amount of work achieved in a single day. The source material contained almost half a million words in total, and I was able to adequately review all of that content in a single day!
In the process of reviewing the AI-generated blog post summaries, I found found a few things that made me chuckle:
November 25, 2003 (Summary): The author of the blog post is discussing how they were unable to find a good way to summarize their own blog posts, despite trying various methods.
Funny how that one came around full circle!
March 21, 2005 (Summary): I can’t believe I’m making so many audioblogs! Download the latest one on the Audioblogs page (a link is on the right side of the page).
The term “Audioblogs” was used before the word “podcast” was popular. I remember recording at least a few episodes, and I’m really disappointed they weren’t preserved with the rest of my blog posts. I also found references to “Videoblogs” or “Vlogs”, and I remember recording at least two episodes of that, but I think those too have been lost to history.
June 4, 2005 (Summary): The author of the blog uses WordPress as their blogging engine instead of MovableType, which they previously used but grew to hate due to speed issues.
That one came full circle too! My blog started as a static site, went to a database-driven site, then worked its way all the back to static where it is today. My best guess at the dates of the various technologies is:
1997-2001: Static site (editing HTML files)
2001-2003: Locally hosted dynamic (classic ASP or PHP)
2003-2005: MovableType (static site with a PHP editor)
2005-2020: Wordpress (PHP with SQL content storage)
2020-2023: Flat file site (with a custom PHP page generator)
2023-present: Static site (built from Markdown by Hugo using GitHub Actions)
Completing this project allowed me to add 325 previously hidden blog posts back onto my website, bringing the total number of posts from 295 to 620! It took several hours for the the AI tools to summarize almost half a milllion words of old blog posts, then a few of my hours spot-checking the summaries and tweaking them where needed, but overall I’m very satisfied by how how I was able to wield generative AI to assist the transformation of such a large amount of text in a single day.
This page describes how I use Python to ingest information from documents on my filesystem and run the Llama 2 large language model (LLM) locally to answer questions about their content. My ultimate goal with this work is to evaluate feasibility of developing an automated system to digest software documentation and serve AI-generated answers to technical questions based on the latest available information.
Quickstart: The previous post Run Llama 2 Locally with Python describes a simpler strategy to running Llama 2 locally if your goal is to generate AI chat responses to text prompts without ingesting content from local documents.
The model I’m using (7B Q8 0) is good but heavy, requiring 7 GB disk space and 10 GB ram. Lighter and faster models are available. I recommend trying different models, and if your application can use a simpler model it may result in significantly improved performance.
Installing these packages requires a C++ compiler. To get one:
Windows: Install Visual Studio Community with the “Desktop development with C++” workload. It is free for individuals an open-source developers. See the C++ installation guide for more information.
I created a few plain text files that contain information for testing. I am careful to be specific enough (and absurd enough) that we can be confident when the AI chat is referencing this material when answering responses about it later.
Scott William Harden is an open-source software developer. He is the primary author of ScottPlot, pyabf, FftSharp, Spectrogram, and several other open-source packages. Scott’s favorite color is dark blue despite the fact that he is colorblind. Scott’s advocacy for those with color vision deficiency (CVD) leads him to recommend perceptually uniform color maps like Viridis instead of traditional color maps like Spectrum and Jet.
“JupyterGoBoom” is the name of a Python package for creating unmaintainable Jupyter notebooks. It is no longer actively developed and is now considered obsolete because modern software developers have come to realize that Jupyter notebooks grow to become unmaintainable all by themselves.
We can use the MiniLM language model to interpret the content our documents and save that information in a vector database so AI chat has access to it. The interpreted information is saved to disk in the FAISS (Facebook AI Similarity Search) file format, a vector database optimized for searching for similarly across large and high dimensional datasets.
"""
This script creates a database of information gathered from local text files.
"""fromlangchain.document_loadersimport DirectoryLoader, TextLoader
fromlangchain.text_splitterimport RecursiveCharacterTextSplitter
fromlangchain.embeddingsimport HuggingFaceEmbeddings
fromlangchain.vectorstoresimport FAISS
# define what documents to loadloader = DirectoryLoader("./", glob="*.txt", loader_cls=TextLoader)
# interpret information in the documentsdocuments = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=500,
chunk_overlap=50)
texts = splitter.split_documents(documents)
embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-MiniLM-L6-v2",
model_kwargs={'device': 'cpu'})
# create and save the local databasedb = FAISS.from_documents(texts, embeddings)
db.save_local("faiss")
Although this example saves the vector database to disk so it can be loaded later by a large language model, you do not have to write the vector database to disk if you only intend to consume it in the same Pythons script. In this case you can omit the db.save_local() and consume the db object later instead of calling FAISS.load_local(), keeping the database in memory the whole time.
Users may desire to customize arguments of the various methods listed here to improve behavior in application-specific ways. For example, RecursiveCharacterTextSplitter() has optional keyword arguments for chunk_size and chunk_overlap which are often customized to best suit the type of content being ingested. See Chunking Strategies for LLM Applications, What Chunk Size and Chunk Overlap Should You Use? and the LangChain documentation for more information.
LangChain has advanced tools available for ingesting information in complex file formats like PDF, Markdown, HTML, and JSON. Plain text files are used in this example to keep things simple, but more information is available in the official documentation.
We can now prepare an AI Chat from a LLM pre-loaded with information contained in our documents and use it to answer questions about their content.
"""
This script reads the database of information from local text files
and uses a large language model to answer questions about their content.
"""fromlangchain.llmsimport CTransformers
fromlangchain.embeddingsimport HuggingFaceEmbeddings
fromlangchain.vectorstoresimport FAISS
fromlangchainimport PromptTemplate
fromlangchain.chainsimport RetrievalQA
# prepare the template we will use when prompting the AItemplate ="""Use the following pieces of information to answer the user's question.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Context: {context}Question: {question}Only return the helpful answer below and nothing else.
Helpful answer:
"""# load the language modelllm = CTransformers(model='./llama-2-7b-chat.ggmlv3.q8_0.bin',
model_type='llama',
config={'max_new_tokens': 256, 'temperature': 0.01})
# load the interpreted information from the local databaseembeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-MiniLM-L6-v2",
model_kwargs={'device': 'cpu'})
db = FAISS.load_local("faiss", embeddings)
# prepare a version of the llm pre-loaded with the local contentretriever = db.as_retriever(search_kwargs={'k': 2})
prompt = PromptTemplate(
template=template,
input_variables=['context', 'question'])
qa_llm = RetrievalQA.from_chain_type(llm=llm,
chain_type='stuff',
retriever=retriever,
return_source_documents=True,
chain_type_kwargs={'prompt': prompt})
# ask the AI chat about information in our local filesprompt ="Who is the author of FftSharp? What is their favorite color?"output = qa_llm({'query': prompt})
print(output["result"])
Here are the answers the script above gave me the the following questions:
Question:Who is the author of FftSharp? What is their favorite color?
Response:Scott William Harden is the author of FftSharp. According to Scott, his favorite color is dark blue despite being colorblind.
Question:Why is JupyterGoBoom obsolete?
Response:JupyterGoBoom is considered obsolete because modern software developers have come to realize that Jupyter notebooks become unmaintainable all by themselves.
Both answers are consistent with the information written in the ingested text documents. The AI is successfully answering questions about our custom content!
Here is a standalone Jupyter notebook that demonstrates how to ingest information from documents and interact with a large language model to have AI chat answer questions about their content. This notebook contains a few extra features to improve formatting of the output as well.
The recent availability of open source large language models that can be run locally offers many interesting opportunities that were not feasible with cloud-only offerings like ChatGPT. For example, consider that your application involves large amounts of sensitive data. You can ingest all the documents and ask AI chat specific questions about their content all on a local machine. This permits development of AI tools that can safely interact with high security operations requiring intellectual property protection, handling of medical data, and other sensitive applications where information security is paramount.
As a maintainer of several open-source projects, I am curious to learn if this strategy may be useful for generating automated responses to questions posted in GitHub issues that have answers which already exist in the software documentation. It would be interesting to incorporate AI content ingestion into the CI/CD pipeline of my software projects so that each software release would be paired with an AI chat bot that provides answers to technical questions with version-specific guidance based upon the documentation generated during that same build workflow. I am also curious about the effectiveness of pairing ingested documentation with GitHub Issues and Discussions so that responses to user questions can point to recommended threads where human conversation may be useful.
There are many avenues to explore from here, and I’m thankful for all of the developers and engineering teams that provide these models and tools for anyone to experiment with.
The day after I wrote this article I was working on an unrelated project that resulted in me Googling myself. I was surprised to see that the silly training text above is now Google’s description of my domain. I hope it will change again when I add a new blog post and that text ends up lower down on the page. This is a good reminder that in the world of automated document evaluation and AI-assisted generation, any text you share online may be consumed by robots and fed to others in ways you may not be able to control.
This page describes how to interact with the Llama 2 large language model (LLM) locally using Python, without requiring internet, registration, or API keys. We will deliver prompts to the model and get AI-generated chat responses using the llama-cpp-python package.
The model I’m using here is the largest and slowest one currently available. It is 7 GB in size and requires 10 GB of ram to run. Developers should experiment with different models, as simpler models may run faster and produce similar results for less complex tasks.
Install the latest version of Python from python.org
Create a virtual environment: python -m venv .venv
Activate the virtual environment: .venv/Scripts/activate
Install the llama-cpp-python package: pip install llama-cpp-python
Installation will fail if a C++ compiler cannot be located. To get one:
Windows: Install Visual Studio Community with the “Desktop development with C++” workload. It is free for individuals an open-source developers. See the C++ installation guide for more information.
Create a new python script and run it inside the virtual environment:
# load the large language model filefromllama_cppimport Llama
LLM = Llama(model_path="./llama-2-7b-chat.ggmlv3.q8_0.bin")
# create a text promptprompt ="Q: What are the names of the days of the week? A:"# generate a response (takes several seconds)output = LLM(prompt)
# display the responseprint(output["choices"][0]["text"])
It took my system 7 seconds to generate this response:
Question:What are the names of the days of the week?
Answer:The names of the days of the week, in order, are: Monday Tuesday Wednesday Thursday Friday Saturday Sunday
To get longer responses to complex questions, set n_ctx to a larger number to increase the number of tokens the large language model uses for context, then remove the max_tokens limit for the query. While this allows longer responses, it can significantly increase the total time required to generate a response.
fromllama_cppimport Llama
# define n_ctx manually to permit larger contextsLLM = Llama(model_path="./llama-2-7b-chat.ggmlv3.q8_0.bin", n_ctx=2048)
# create a text promptprompt ="Q: Why are Jupyter notebooks difficult to maintain? A:"# set max_tokens to 0 to remove the response size limitoutput = LLM(prompt, max_tokens=0)
# display the responseprint(output["choices"][0]["text"])
It took my system 1.5 minutes to generate this response:
Question: Why are Jupyter notebooks difficult to maintain?
Response:
Jupyter notebooks can be challenging to maintain for several reasons:
Temporary file system: Jupyter notebooks reside on the user’s local disk, which can make them unreliable and difficult to maintain over time.
Version control issues: As notebooks are updated frequently, it becomes challenging to keep track of changes and manage different versions of a notebook. This can lead to conflicts when working with others on the same project.
Lack of organization: Since Jupyter notebooks are created manually, it’s easy to lose track of which files are included in the notebook and how they relate to each other. This makes it challenging to maintain a well-organized structure for large projects with many components.
Limited automation: There are limited tools available for automating tasks within Jupyter notebooks, making it more challenging to create reproducible builds or automate the testing process.
Security concerns: Since Jupyter notebooks contain sensitive data such as user credentials, personal information, and intellectual property, they must be protected against unauthorized access and breaches. This can further complicate maintenance efforts.
Interactive interfaces: Jupyter notebooks are designed for interactive use, which means that any changes to the code or visualizations will affect the interface in real-time. This can make it challenging to maintain a consistent appearance while working on a project.
Limited collaboration support: Jupyter notebooks are primarily designed as an individual tool, which limits their ability to accommodate multiple users and their contributions simultaneously. As a result, collaboration becomes more complicated with larger teams.
Lack of integration with other tools: Jupyter notebooks do not integrate well with other popular development tools like Git or IDEs (Integrated Development Environments). This can make it difficult to manage multiple projects, keep track of changes, and collaborate effectively with team members.
Limited scalability: As projects grow in size and complexity, Jupyter notebooks may become less efficient and more challenging to maintain. This is particularly true for large-scale data science or machine learning projects that require a lot of processing power and memory.
Lack of formalized workflows: Since Jupyter notebooks are primarily designed as an interactive environment, they don’t provide standardized workflows or processes for managing projects. This can result in a less organized and more challenging maintenance experience for project teams.
By understanding these challenges, you can better prepare yourself to manage the complexities of Jupyter notebooks and develop strategies to address them effectively.
Here is a standalone Jupyter notebook that demonstrates how to use different large language models to generate AI chat responses to plain text prompts. This notebook contains a few extra features to improve formatting of the output as well.
The AI-generated responses above only contain information built into the model itself. What if we want to give our language model custom information then have it be able to answer questions about it?
This website is a static site containing thousands of pages, and I recently had the desire to check all of them for broken links and images. Although it would be a tedious task to perform manually, I was able to automate the process using Python. This page describes the process I used to identify broken links and images and may be helpful to others using static site generator tools like Jekyll, Hugo, Eleventy, Pelican, Gatsby, other Jamstack site generators, or hand-written HTML files.
Although my website content is stored as markdown files, I found it most convenient to check the generated HTML files for broken link and image URLs. I generated my static site locally, then manually removed pagination folders so only article HTML files were present. I then wrote the following Python script which uses pathlib to locate HTML files and BeautifulSoup to analyze a href and img src attributes and saves all URLs identified as a CSV file.
importpathlibfrombs4import BeautifulSoup
defget_urls(file: pathlib.Path) ->list:
"""Return link and image URLs in a HTML file"""withopen(file, errors='ignore') as f:
html = f.read()
soup = BeautifulSoup(html, 'html.parser')
urls =set()
for link in soup.find_all('a'):
href = link.get('href')
if href.startswith("#"):
continueif"#"in href:
href = href.split("#")[0]
href =str(href).strip('/')
urls.add(href)
for image in soup.find_all('img'):
src = image.get('src')
urls.add(str(src))
returnsorted(list(urls))
defget_urls_by_page(folder: pathlib.Path) ->dict:
"""Return link and image URLs for all HTML files in a folder""" html_files =list(folder.rglob("*.html"))
urls_by_page = {}
for i, html_file inenumerate(html_files):
urls = get_urls(html_file)
urls_by_page[html_file] = urls
print(f"{i+1} of {len(html_files)}: {len(urls)} URLs found in {html_file}")
return urls_by_page
defwrite_csv(urls_by_page: dict, csv_file: pathlib.Path):
txt ='URL, Page\n'for page, urls in urls_by_page.items():
for url in urls:
txt +=f'{url}, {page}\n' csv_file.write_text(txt)
if __name__ =="__main__":
folder = pathlib.Path("public")
urls_by_page = get_urls_by_page(folder)
write_csv(urls_by_page, pathlib.Path("urls.csv"))
Running the script generated a CSV report showing every link on my website, organized by which page it’s on. It looks like my blog has over 7,000 links! That would be a lot to check by hand.
Each URL from the report was checked using HEAD requests. Note that a HEAD request to a file path only returns HTTP headers but does not actually download the file, allowing it to consume far less bandwidth GET requests. Inspecting the HTTP response code indicates whether the URL is a valid path to a file (code 200).
I used python sets to prevent checking the same URL twice. I logged good and broken URLs as they were checked, and consumed these log files at startup to allow the program to be stopped and restarted without causing it to check the same URL twice.
importpathlibimporturllib.requestimporttimedefis_url_valid(url: str) ->bool:
"""Check if a URL exists without downloading its contents""" request = urllib.request.Request(url)
request.get_method =lambda: 'HEAD'try:
urllib.request.urlopen(request, timeout=5)
returnTrueexceptExceptionas e:
print(f"ERROR: {e}")
returnFalseif __name__ =="__main__":
# load URLs from report urls = [x.split(", ")[0] for x
in pathlib.Path("urls.csv").read_text().split("\n")
if x.startswith("http")]
# load previously checked URLs url_file_good = pathlib.Path("urls-good.txt")
url_file_good.touch()
url_file_bad = pathlib.Path("urls-bad.txt")
url_file_bad.touch()
checked =set()
for url in url_file_good.read_text().split("\n"):
checked.add(url)
for url in url_file_bad.read_text().split("\n"):
checked.add(url)
# check each URLfor i, url inenumerate(urls):
print(f"{i+1} of {len(urls)}", end=" ")
if url in checked:
print(f"SKIPPING {url}")
continue time.sleep(.2)
print(f"CHECKING {url}")
log_file = url_file_good if is_url_valid(url) else url_file_bad
withopen(log_file, 'a') as f:
f.write(url+"\n")
checked.add(url)
I wrote a Python script to generate a HTML report of all the broken links on my website. The script shows every broken link found on the website and lists all the pages on which each broken link appears.
importpathliburls_and_pages = [x.split(", ") for x
in pathlib.Path("urls.csv").read_text().split("\n")
if x.startswith("http")]
urls_broken = [x for x
in pathlib.Path("urls-bad.txt").read_text().split("\n")
if x.startswith("http")]
urls_broken =set(urls_broken)
html ="<h1>Broken Links</h1>"for url_broken in urls_broken:
html +=f"<div class='mt-4'><a href='{url_broken}'>{url_broken}</a></div>" pages = [x[1] for x in urls_and_pages if x[0] == url_broken]
for page in pages:
html +=f"<div><code>{page}</code></div>"
I wrapped the output in Bootstrap to make the report look pretty and appear properly on mobile devices:
The generated report lets me focus my effort narrowly to tactically fix the broken URLs I think are most important (e.g., image URLs, URLs pointing domain names).
Testing URLs extracted from a folder of HTML files proved to be an effective method for identifying broken links and images across a large static website with thousands of pages.
This method could be employed in a CI/CD pipeline to ensure links and image paths are valid on newly created pages.
If the goal is to validate internal links only, HTTP requests could be replaced with path existence checks if the entire website is present in the local filesystem.
Treemap diagrams display a series of positive numbers using rectangles sized proportional to the value of each number. This page demonstrates how to calculate the size and location of rectangles to create a tree map diagram using C#. Although the following uses System.Drawing to save the tree map as a Bitmap image, these concepts may be combined with information on the C# Data Visualization page to create treemap diagrams using SkiaSharp, WPF, or other graphics technologies.
The tree map above was generated from random data using the following C# code:
// Create sample data. Data must be sorted large to small.double[] sortedValues = Enumerable.Range(0, 40)
.Select(x => (double)Random.Shared.Next(10, 100))
.OrderByDescending(x => x)
.ToArray();
// Create an array of labels in the same order as the sorted data.string[] labels = sortedValues.Select(x => x.ToString()).ToArray();
// Calculate the size and position of all rectangles in the tree mapint width = 600;
int height = 400;
RectangleF[] rectangles = TreeMap.GetRectangles(sortedValues, width, height);
// Create an image to draw on (with 1px extra to make room for the outline)usingBitmap bmp = new(width + 1, height + 1);
usingGraphics gfx = Graphics.FromImage(bmp);
usingFont fnt = new("Consolas", 8);
usingSolidBrush brush = new(Color.Black);
gfx.Clear(Color.White);
// Draw and label each rectanglefor (int i = 0; i < rectangles.Length; i++)
{
brush.Color = Color.FromArgb(
red: Random.Shared.Next(150, 250),
green: Random.Shared.Next(150, 250),
blue: Random.Shared.Next(150, 250));
gfx.FillRectangle(brush, rectangles[i]);
gfx.DrawRectangle(Pens.Black, rectangles[i]);
gfx.DrawString(labels[i], fnt, Brushes.Black, rectangles[i].X, rectangles[i].Y);
}
// Save the outputbmp.Save("treemap.bmp");
The previous code block focuses on data generation and display, but hides the tree map calculations behind the TreeMap class. Below is the code for that class. It is self-contained static class and exposes a single static method which takes a pre-sorted array of values and returns tree map rectangles ready to display on an image.
💡 Although the System.Drawing.Common is a Windows-only library (as of .NET 7), System.Drawing.Primitives is a cross-platform package that provides the RectangleF structure used in the tree map class. See the SkiaSharp Quickstart to learn how to create image files using cross-platform .NET code.
publicstaticclassTreeMap{
publicstatic RectangleF[] GetRectangles(double[] values, int width, int height)
{
for (int i = 1; i < values.Length; i++)
if (values[i] > values[i - 1])
thrownew ArgumentException("values must be ordered large to small");
var slice = GetSlice(values, 1, 0.35);
var rectangles = GetRectangles(slice, width, height);
return rectangles.Select(x => x.ToRectF()).ToArray();
}
privateclassSlice {
publicdouble Size { get; }
public IEnumerable<double> Values { get; }
public Slice[] Children { get; }
public Slice(double size, IEnumerable<double> values, Slice sub1, Slice sub2)
{
Size = size;
Values = values;
Children = new Slice[] { sub1, sub2 };
}
public Slice(double size, double finalValue)
{
Size = size;
Values = newdouble[] { finalValue };
Children = Array.Empty<Slice>();
}
}
privateclassSliceResult {
publicdouble ElementsSize { get; }
public IEnumerable<double> Elements { get; }
public IEnumerable<double> RemainingElements { get; }
public SliceResult(double elementsSize, IEnumerable<double> elements, IEnumerable<double> remainingElements)
{
ElementsSize = elementsSize;
Elements = elements;
RemainingElements = remainingElements;
}
}
privateclassSliceRectangle {
public Slice Slice { get; set; }
publicfloat X { get; set; }
publicfloat Y { get; set; }
publicfloat Width { get; set; }
publicfloat Height { get; set; }
public SliceRectangle(Slice slice) => Slice = slice;
public RectangleF ToRectF() => new(X, Y, Width, Height);
}
privatestatic Slice GetSlice(IEnumerable<double> elements, double totalSize, double sliceWidth)
{
if (elements.Count() == 1)
returnnew Slice(totalSize, elements.Single());
SliceResult sr = GetElementsForSlice(elements, sliceWidth);
Slice child1 = GetSlice(sr.Elements, sr.ElementsSize, sliceWidth);
Slice child2 = GetSlice(sr.RemainingElements, 1 - sr.ElementsSize, sliceWidth);
returnnew Slice(totalSize, elements, child1, child2);
}
privatestatic SliceResult GetElementsForSlice(IEnumerable<double> elements, double sliceWidth)
{
var elementsInSlice = new List<double>();
var remainingElements = new List<double>();
double current = 0;
double total = elements.Sum();
foreach (var element in elements)
{
if (current > sliceWidth)
remainingElements.Add(element);
else {
elementsInSlice.Add(element);
current += element / total;
}
}
returnnew SliceResult(current, elementsInSlice, remainingElements);
}
privatestatic IEnumerable<SliceRectangle> GetRectangles(Slice slice, int width, int height)
{
SliceRectangle area = new(slice) { Width = width, Height = height };
foreach (var rect in GetRectangles(area))
{
if (rect.X + rect.Width > area.Width)
rect.Width = area.Width - rect.X;
if (rect.Y + rect.Height > area.Height)
rect.Height = area.Height - rect.Y;
yieldreturn rect;
}
}
privatestatic IEnumerable<SliceRectangle> GetRectangles(SliceRectangle sliceRectangle)
{
var isHorizontalSplit = sliceRectangle.Width >= sliceRectangle.Height;
var currentPos = 0;
foreach (var subSlice in sliceRectangle.Slice.Children)
{
var subRect = new SliceRectangle(subSlice);
int rectSize;
if (isHorizontalSplit)
{
rectSize = (int)Math.Round(sliceRectangle.Width * subSlice.Size);
subRect.X = sliceRectangle.X + currentPos;
subRect.Y = sliceRectangle.Y;
subRect.Width = rectSize;
subRect.Height = sliceRectangle.Height;
}
else {
rectSize = (int)Math.Round(sliceRectangle.Height * subSlice.Size);
subRect.X = sliceRectangle.X;
subRect.Y = sliceRectangle.Y + currentPos;
subRect.Width = sliceRectangle.Width;
subRect.Height = rectSize;
}
currentPos += rectSize;
if (subSlice.Values.Count() > 1)
{
foreach (var sr in GetRectangles(subRect))
{
yieldreturn sr;
}
}
elseif (subSlice.Values.Count() == 1)
{
yieldreturn subRect;
}
}
}
}
A few days ago I wrote an article describing how to programmatically generate .NET source code analytics using C#. Using these tools I analyzed the source code for all classes in a large project (ScottPlot.NET). The following tree map displays every class in the project as a rectangle sized according to number of lines of code and colored according to maintainability.
In this diagram large rectangles represent classes with the most code, and red color indicates classes that are difficult to maintain.
💡 I’m using a perceptually uniform colormap (similar to Turbo)provided by the ScottPlot provided by the ScottPlot NuGet package. See ScottPlot’s colormaps gallery for all available colormaps.
💡 The Maintainability Index is a value between 0 (worst) and 100 (best) that represents the relative ease of maintaining the code. It’s calculated from a combination of Halstead complexity (size of the compiled code), Cyclomatic complexity (number of paths that can be taken through the code), and the total number of lines of code.