The personal website of Scott W Harden

Using Llama 2 to Answer Questions About Local Documents

How to use open source large language models to ingest information from local files and generate AI responses to questions about their content

This page describes how I use Python to ingest information from documents on my filesystem and run the Llama 2 large language model (LLM) locally to answer questions about their content. My ultimate goal with this work is to evaluate feasibility of developing an automated system to digest software documentation and serve AI-generated answers to technical questions based on the latest available information.

Quickstart: The previous post Run Llama 2 Locally with Python describes a simpler strategy to running Llama 2 locally if your goal is to generate AI chat responses to text prompts without ingesting content from local documents.

Environment Setup

Download a Llama 2 model in GGML Format

The model I’m using (7B Q8 0) is good but heavy, requiring 7 GB disk space and 10 GB ram. Lighter and faster models are available. I recommend trying different models, and if your application can use a simpler model it may result in significantly improved performance.

Create a Virtual Environment

Install Required Packages

Altogether we will need the langchain, sentence-transformers, faiss-cpu, and ctransformers packages:

pip install langchain sentence_transformers faiss-cpu ctransformers

Installing these packages requires a C++ compiler. To get one:

Prepare Documents for Ingestion

I created a few plain text files that contain information for testing. I am careful to be specific enough (and absurd enough) that we can be confident when the AI chat is referencing this material when answering responses about it later.


Scott William Harden is an open-source software developer. He is the primary author of ScottPlot, pyabf, FftSharp, Spectrogram, and several other open-source packages. Scott’s favorite color is dark blue despite the fact that he is colorblind. Scott’s advocacy for those with color vision deficiency (CVD) leads him to recommend perceptually uniform color maps like Viridis instead of traditional color maps like Spectrum and Jet.


“JupyterGoBoom” is the name of a Python package for creating unmaintainable Jupyter notebooks. It is no longer actively developed and is now considered obsolete because modern software developers have come to realize that Jupyter notebooks grow to become unmaintainable all by themselves.

Interpreting Content of Local Files

We can use the MiniLM language model to interpret the content our documents and save that information in a vector database so AI chat has access to it. The interpreted information is saved to disk in the FAISS (Facebook AI Similarity Search) file format, a vector database optimized for searching for similarly across large and high dimensional datasets.

This script creates a database of information gathered from local text files.

from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

# define what documents to load
loader = DirectoryLoader("./", glob="*.txt", loader_cls=TextLoader)

# interpret information in the documents
documents = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=500,
texts = splitter.split_documents(documents)
embeddings = HuggingFaceEmbeddings(
    model_kwargs={'device': 'cpu'})

# create and save the local database
db = FAISS.from_documents(texts, embeddings)

Although this example saves the vector database to disk so it can be loaded later by a large language model, you do not have to write the vector database to disk if you only intend to consume it in the same Pythons script. In this case you can omit the db.save_local() and consume the db object later instead of calling FAISS.load_local(), keeping the database in memory the whole time.

Users may desire to customize arguments of the various methods listed here to improve behavior in application-specific ways. For example, RecursiveCharacterTextSplitter() has optional keyword arguments for chunk_size and chunk_overlap which are often customized to best suit the type of content being ingested. See Chunking Strategies for LLM Applications, What Chunk Size and Chunk Overlap Should You Use? and the LangChain documentation for more information.

LangChain has advanced tools available for ingesting information in complex file formats like PDF, Markdown, HTML, and JSON. Plain text files are used in this example to keep things simple, but more information is available in the official documentation.

Prepare an AI That is Aware of Local File Content

We can now prepare an AI Chat from a LLM pre-loaded with information contained in our documents and use it to answer questions about their content.

This script reads the database of information from local text files
and uses a large language model to answer questions about their content.

from langchain.llms import CTransformers
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain import PromptTemplate
from langchain.chains import RetrievalQA

# prepare the template we will use when prompting the AI
template = """Use the following pieces of information to answer the user's question.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Context: {context}
Question: {question}
Only return the helpful answer below and nothing else.
Helpful answer:

# load the language model
llm = CTransformers(model='./llama-2-7b-chat.ggmlv3.q8_0.bin',
                    config={'max_new_tokens': 256, 'temperature': 0.01})

# load the interpreted information from the local database
embeddings = HuggingFaceEmbeddings(
    model_kwargs={'device': 'cpu'})
db = FAISS.load_local("faiss", embeddings)

# prepare a version of the llm pre-loaded with the local content
retriever = db.as_retriever(search_kwargs={'k': 2})
prompt = PromptTemplate(
    input_variables=['context', 'question'])
qa_llm = RetrievalQA.from_chain_type(llm=llm,
                                     chain_type_kwargs={'prompt': prompt})

# ask the AI chat about information in our local files
prompt = "Who is the author of FftSharp? What is their favorite color?"
output = qa_llm({'query': prompt})


Here are the answers the script above gave me the the following questions:

Question: Who is the author of FftSharp? What is their favorite color?

Response: Scott William Harden is the author of FftSharp. According to Scott, his favorite color is dark blue despite being colorblind.

Question: Why is JupyterGoBoom obsolete?

Response: JupyterGoBoom is considered obsolete because modern software developers have come to realize that Jupyter notebooks become unmaintainable all by themselves.

Both answers are consistent with the information written in the ingested text documents. The AI is successfully answering questions about our custom content!

Jupyter Notebook

Here is a standalone Jupyter notebook that demonstrates how to ingest information from documents and interact with a large language model to have AI chat answer questions about their content. This notebook contains a few extra features to improve formatting of the output as well.

Ideas and Future Directions

The recent availability of open source large language models that can be run locally offers many interesting opportunities that were not feasible with cloud-only offerings like ChatGPT. For example, consider that your application involves large amounts of sensitive data. You can ingest all the documents and ask AI chat specific questions about their content all on a local machine. This permits development of AI tools that can safely interact with high security operations requiring intellectual property protection, handling of medical data, and other sensitive applications where information security is paramount.

As a maintainer of several open-source projects, I am curious to learn if this strategy may be useful for generating automated responses to questions posted in GitHub issues that have answers which already exist in the software documentation. It would be interesting to incorporate AI content ingestion into the CI/CD pipeline of my software projects so that each software release would be paired with an AI chat bot that provides answers to technical questions with version-specific guidance based upon the documentation generated during that same build workflow. I am also curious about the effectiveness of pairing ingested documentation with GitHub Issues and Discussions so that responses to user questions can point to recommended threads where human conversation may be useful.

There are many avenues to explore from here, and I’m thankful for all of the developers and engineering teams that provide these models and tools for anyone to experiment with.

Update: Google Took My Fake Training Data Seriously

The day after I wrote this article I was working on an unrelated project that resulted in me Googling myself. I was surprised to see that the silly training text above is now Google’s description of my domain. I hope it will change again when I add a new blog post and that text ends up lower down on the page. This is a good reminder that in the world of automated document evaluation and AI-assisted generation, any text you share online may be consumed by robots and fed to others in ways you may not be able to control.