Conversational AI with PDFs

12 min readOct 19, 2023

Crafting a Conversational AI with PDFs: Coding Adventure

Introduction

Fellow code explorers! 🚀 Today, we traverse through the intriguing terrains of Python, Streamlit, and Generative AI to craft an application that enables us to converse with the content embedded in PDFs! Buckle up, as we dive into this coding journey, step-by-step, ensuring a fun, educational, and engaging adventure.

Prerequisites:

For this setup, I am assuming you already have python on your VSCode and have setup a virtual environment.

We will need to install below dependencies for our python code, use the command below to install most of them. You can also install them one by one.

pip install streamlit pypdf2 langchain python-dotenv faiss-cpu openai huggingface_hub tiktoken

Basic Setup: Anchoring our Ship with `main()` 🚢

Before we dive into the depths of our code, let’s understand our treasure map, the `main()` function. It’s where our adventure begins and ends, orchestrating the flow of our application. I am calling the application as chatpdf.py and creating an .env file with values for OPENAI_API_KEY and HUGGINGFACE_ACCESS_TOKEN

OPENAI_API_KEY=
HUGGINGFACE_ACCESS_TOKEN=

This will look something like below

Get the Open AI Key

Setup API and load environment variables

https://platform.openai.com/account/api-keys

Huggingface API key

Hugging Face - The AI community building the future.

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Save these values in .env file that was created in above step. This will be required to make API calls to OpenAI and HuggingFace.

from dotenv import load_dotenv
def main():
  
    #load environment varialbles
    load_dotenv();
   

if __name__ == '__main__':
    main()

Run the basis code as shown below, make sure you are running this in your virtual environment in python

Setting Up Our Interactive Webpage with Streamlit 🏝

Our first destination brings us to the lush lands of Streamlit. Let’s pitch our UI tent and invite our users to interact! Use below code snippet for guidance.

import streamlit as st 
from dotenv import load_dotenv
def main():
  
    #load environment varialbles
    load_dotenv();

    #Step1
    st.set_page_config(page_title="CHat with multiple pdf",page_icon=":books:")
    
    st.header("chat with multiple pdfs :books:")
   
 

if __name__ == '__main__':
    main()

Run the application as shown below

streamlit run app.py

This should bring your basic application up and running and should look like below

🗝 Key Takeaway: Begin by establishing a friendly and inviting user interface.

PDF upload and The Text Extraction 📜

Our adventure takes a deeper dive as we explore the oceans of text extraction from the uploaded PDFs. Lets define the button on UI to upload pdf document and extract text from pdf add below code

import streamlit as st 
from dotenv import load_dotenv
def main():
  
    #load environment varialbles
    load_dotenv();

    #Step1
    st.set_page_config(page_title="CHat with multiple pdf",page_icon=":books:")

  
    st.header("chat with multiple pdfs :books:")
    
    
    #enable file upload
    with st.sidebar:
        st.subheader("Your documents")
        pdf_docs=st.file_uploader("Upload your pdfs here and click on 'Process'", accept_multiple_files=True)

if __name__ == '__main__':
    main()

Now run the application using streamlit run command and you can see the application allowing to upload multiple files. ( ignore the right side in the screenshot :-) )

Now as a next step we would like to “Process the pdfs”. This will include processing the pdfs to get text, creating the chunks out of those texts so that we can process them locally . Once text chunks are created we will create vector embeddings and store them. Lets do it then..

Processing PDF

Add below code right after when you are uploading the document, this is shown below in the code snippet. Make sure he indentation of If block is right below st.hubheader. Otherwise the button will show on wrong side.

 with st.sidebar:
        st.subheader("Your documents")
        pdf_docs=st.file_uploader("Upload your pdfs here and click on 'Process'", accept_multiple_files=True)
   
        #handle button click and processing
        if st.button("Process"):
                with st.spinner("Processing"):
                    #get the pdf text
                    raw_text = getpdftext(pdf_docs)

raw_text will store all the text coming from getpdftext() function. So now define this function as shown below. Import PyPDF2 so that text extract functionality can be utilized.

from PyPDF2 import PdfReader

#get textx from pdf
def getpdftext(pdf_docs):
    text=""
    for pdf in pdf_docs:
        pdf_reader= PdfReader(pdf)
        #Read each page
        for page in pdf_reader.pages:
            text+=page.extract_text()
    return text

Above code is self explanatory, I think there not much to explain here.

The UI experience will look like below It should look like this

You can also checkout the ras text output via below code . This will show all the raw text that was generated from the uploaded pdfs.

st.write(raw_text)

Chunking Raw text

Now , lets create text chunks out of raw text created above. Add below code

if st.button("Process"):
                with st.spinner("Processing"):
                    #get the pdf text
                    raw_text = getpdftext(pdf_docs)

                     #get gets chunks
                    text_chunks=getTextChunks(raw_text)

Define the function getTextChunks(). We will also have to import CHaracterTextSplitter from langchain.text_splitter. So lets import it first

from langchain.text_splitter import CharacterTextSplitter

#Get text chunks from Rawtext using langChain
def getTextChunks(raw_text):
    text_splitter= CharacterTextSplitter(
        separator="\n",
        chunk_size=1000,
        chunk_overlap=200,
        length_function=len
    )
    chunks = text_splitter.split_text(raw_text)
    return chunks

In above code, we are taking the raw_text as input and creating chunk size as 1000 characters with overall of 200 characters. The overlap is created to preserve the continuity of the sentences and its context. This help is not loosing the continuity. you can decide it based it on your case as well.

Discovering the Vector Store Island 🏞

Our journey continues towards embedding our treasures and storing them in the mystical vector store.

Lets add the function getVectorStore() as shown below and the code block will look like below

 #create vector embedding and store 
                    vectorstore=getVectorStore(text_chunks)

#handle button click and processing
        if st.button("Process"):
                with st.spinner("Processing"):
                    #get the pdf text
                    raw_text = getpdftext(pdf_docs)

                     #get gets chunks
                    text_chunks=getTextChunks(raw_text)

                     #create vector embedding and store 
                    vectorstore=getVectorStore(text_chunks)

Lets define the getVectorStore with help of new imports

from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.vectorstores import faiss

You can observe that we are importing HuggingFaceInstructEmbeddings and utilizing faiss as vectorstores. This is the are where you can experiment more bu utilizing more robust and efficient ways of creating embedding asn vector stores.

Now write the below code to implement vector embeddings. Here I am using a free embedding instead of openAI. You can also use open AI but that will make you pay more $$.

You can decide by reading below links :

embedding price: https://openai.com/pricing

free one : https://instructor-embedding.github.io/

pip install InstructorEmbedding sentence_transformers


def getVectorStore(text_chunks):
    #embeddings = OpenAIEmbeddings()
    embeddings = HuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-xl")
    vectorStore= faiss.FAISS.from_texts(texts=text_chunks,embedding=embeddings)
    return vectorStore

So far we have completed the flow where you upload the pdfs, process it. Then we extract texts from pdf and create chunks of out it. Finally we create embeddings of it and store it.

🔍 Key Takeaway:

Embeddings transition our textual treasures into a format that our AI can navigate.

Crafting the Conversational Spell 🗣

With treasures in hand, let’s conjure a spell to transform our application into a conversational entity!

Now its time to make it a conversational chat with pdf. Since its a conversation Its a good idea to have it persisted locally. So now lets begin our journey of conversation using those embeddings generated above.

This will me managed under st.session_state.conversation

Introduce this code which start the conversation

st.session_state.conversation = getConversationChain(vectorstore)

Your code block should look like below

 #handle button click and processing
        if st.button("Process"):
                with st.spinner("Processing"):
                    #get the pdf text
                    raw_text = getpdftext(pdf_docs)

                     #get gets chunks
                    text_chunks=getTextChunks(raw_text)

                     #create vector embedding and store 
                    vectorstore=getVectorStore(text_chunks)

                    st.session_state.conversation = getConversationChain(vectorstore)

and now lets define the getConversationChain function as shown below. We will need to import some libraries

from langchain.chat_models import ChatOpenAI
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain

def getConversationChain(vectorstore):
    llm = ChatOpenAI()
    memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)
    conversation_chain = ConversationalRetrievalChain.from_llm(
        llm=llm,
        retriever=vectorstore.as_retriever(),
        memory=memory
    )
    return conversation_chain

Here we are using the power of LLMs as ChatOpenAI() and storing the conversation using ConversationBufferMemory.

The function getConversationChain is designed to set up a conversational interaction chain by combining a chatbot model (ChatOpenAI), a retriever mechanism (vectorstore), and a memory buffer (ConversationBufferMemory). The goal is likely to have more coherent and context-aware conversations, where the system remembers previous interactions and can fetch relevant information on demand. Below ate some detailed highlights of above function.

Initializing a Conversation Memory:

We’re initializing a memory buffer for the conversation using the ConversationBufferMemory class.
memory_key='chat_history': It's likely that this memory buffer uses some kind of key-value storage system, and we're specifying that the key for this particular buffer is 'chat_history'.
return_messages=True: This argument suggests that when we retrieve information from this memory, it will return the messages or conversational history.

Creating the Conversational Retrieval Chain:

Here, we’re using the ConversationalRetrievalChain class to create a 'chain' or a system that combines our chat model (llm) with a retriever and a memory buffer.
.from_llm(): This is a class method that helps in creating an instance of the ConversationalRetrievalChain from a given language model (llm).
llm=llm: We're passing the ChatOpenAI instance we created earlier.
retriever=vectorstore.as_retriever(): This is taking our vectorstore argument and converting it into a 'retriever' using the as_retriever() method. A retriever, in this context, is likely a tool or system that fetches relevant information from a database or knowledge base.
memory=memory: We're passing the conversation memory buffer we created earlier.

✨ Key Takeaway:

The conversational chain binds our adventure, linking our users to the knowledge within the PDFs.

Engaging with our User Explorers 💬

With every piece of the puzzle in place, we return to our main island to engage with our explorers.

Before doing it we will have to enable our user interface to accept inputs from text input. So let's add some HTML code here. The main code will look like below

def main():
  
    #load environment varialbles
    load_dotenv();
    #Step1
    st.set_page_config(page_title="CHat with multiple pdf",page_icon=":books:")
    
    #keeping the conversation state check
    if "conversation" not in st.session_state:
        st.session_state.conversation=None

    st.header("chat with multiple pdfs :books:")
    
    #Handle chat functionlity
    user_question=st.text_input("Ask a question about your document")
    if user_question:
        handleUserInput(user_question)

Lets define handleUserInput()

Please note that st.session_state.conversation is managing the entire conversation state


def handleUserInput(user_question):
    response=st.session_state.conversation({'question':user_question})
    st.write(response)

Let's make it pretty💎

Now we will add some HTML and CSS elements to make this more User friendly. So add below come in main()

from htmlTemplates import css
:
:
:

st.write(css, unsafe_allow_html=True)

It should look like this

def main():
  
    #load environment varialbles
    load_dotenv();

    #Step1
    st.set_page_config(page_title="CHat with multiple pdf",page_icon=":books:")
    st.write(css, unsafe_allow_html=True)

Create a htmltemplates.py , this is the file where we are going to save the HTML templates. It should look like below

css = '''
<style>
.chat-message {
    padding: 1.5rem; border-radius: 0.5rem; margin-bottom: 1rem; display: flex
}
.chat-message.user {
    background-color: #2b313e
}
.chat-message.bot {
    background-color: #475063
}
.chat-message .avatar {
  width: 20%;
}
.chat-message .avatar img {
  max-width: 78px;
  max-height: 78px;
  border-radius: 50%;
  object-fit: cover;
}
.chat-message .message {
  width: 80%;
  padding: 0 1.5rem;
  color: #fff;
}
'''

bot_template = '''
<div class="chat-message bot">
    <div class="avatar">
        <img src="https://i.ibb.co/cN0nmSj/Screenshot-2023-05-28-at-02-37-21.png">
    </div>
    <div class="message">{{MSG}}</div>
</div>
'''

user_template = '''
<div class="chat-message user">
    <div class="avatar">
        <img src="https://i.ibb.co/rdZC7LZ/Photo-logo-1.png">
    </div>    
    <div class="message">{{MSG}}</div>
</div>
'''

Now we can import these templates into our chat application.

from htmlTemplates import css, bot_template, user_template

In the section, where we are handling user input right after that you can define the writing template like below

if user_question:
        handleUserInput(user_question)

    st.write(user_template.replace("{{MSG}}","Hello robot"), unsafe_allow_html=True)
    st.write(bot_template.replace("{{MSG}}","Hello Human"), unsafe_allow_html=True)

these lines take predefined templates (user_template and bot_template), replace placeholders in them with specific messages, and then display the modified templates on a Streamlit web page with HTML rendering enabled.

Now it will look like this

Now we will have to replace the {[MSG}} with outputs from chatAPI response. (user_template.replace(“{{MSG}}”,”Hello robot”)

In order to do that we will use st.session_state.chat_history. This is something you can easily see in the json response coming from above.

{
  "question": "summarize this pdf",
  "chat_history": [
    "HumanMessage(content='summarize this pdf')",
    "AIMessage(content='The PDF discusses the use of vector embeddings and the Language Model for information retrieval and content generation. It explains how time series collections can efficiently ingest and store clickstreams for analytics purposes. It also highlights the benefits of using AI models for rich media analysis and generation. The PDF emphasizes the use of the Language Model for tasks such as natural language processing, computer vision, and content generation. It explains the workflow of combining custom data with the Language Model to generate reliable outputs. The PDF concludes by discussing the transformational impact of vector search and Language Models on unstructured data.')"
  ],
  "answer": "The PDF discusses the use of vector embeddings and the Language Model for information retrieval and content generation. It explains how time series collections can efficiently ingest and store clickstreams for analytics purposes. It also highlights the benefits of using AI models for rich media analysis and generation. The PDF emphasizes the use of the Language Model for tasks such as natural language processing, computer vision, and content generation. It explains the workflow of combining custom data with the Language Model to generate reliable outputs. The PDF concludes by discussing the transformational impact of vector search and Language Models on unstructured data."
}

if "chat_history" not in st.session_state:
        st.session_state.chat_history=None

Code will look like below

def main():
  
    #load environment varialbles
    load_dotenv();

    #Step1
    st.set_page_config(page_title="CHat with multiple pdf",page_icon=":books:")
    st.write(css, unsafe_allow_html=True)

    if "conversation" not in st.session_state:
        st.session_state.conversation=None
    if "chat_history" not in st.session_state:
        st.session_state.chat_history=None

Now let’s enhance the handleUserInput function. Add this code to it

 st.session_state.chat_history=response['chat_history']

    for index, message in enumerate(st.session_state.chat_history):
        if index% 2== 0:
            st.write(user_template.replace("{{MSG}}",message.content), unsafe_allow_html=True)
        else:
            st.write(bot_template.replace("{{MSG}}",message.content), unsafe_allow_html=True)

if should look like

def handleUserInput(user_question):
    response=st.session_state.conversation({'question':user_question})
    #st.write(response)
    st.session_state.chat_history=response['chat_history']

    for index, message in enumerate(st.session_state.chat_history):
        if index% 2== 0:
            st.write(user_template.replace("{{MSG}}",message.content), unsafe_allow_html=True)
        else:
            st.write(bot_template.replace("{{MSG}}",message.content), unsafe_allow_html=True)

Let’s test it. the first question was “summarize this pdf”. the response would look like below

The second question is “now explain what is on the last page”

Entire code will look like below

import streamlit as st 
from dotenv import load_dotenv
from PyPDF2 import PdfReader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain.vectorstores import faiss
from langchain.chat_models import ChatOpenAI
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain
from htmlTemplates import css, user_template, bot_template

#get textx from pdf
def getpdftext(pdf_docs):
    text=""
    for pdf in pdf_docs:
        pdf_reader= PdfReader(pdf)
        #Read each page
        for page in pdf_reader.pages:
            text+=page.extract_text()
    return text

#Get text chunks from Rawtext using langChain
def getTextChunks(raw_text):
    text_splitter= CharacterTextSplitter(
        separator="\n",
        chunk_size=1000,
        chunk_overlap=200,
        length_function=len
    )
    chunks = text_splitter.split_text(raw_text)
    return chunks


def getVectorStore(text_chunks):
    #embeddings = OpenAIEmbeddings()
    embeddings = HuggingFaceInstructEmbeddings(model_name="hkunlp/instructor-xl")
    vectorStore= faiss.FAISS.from_texts(texts=text_chunks,embedding=embeddings)
    return vectorStore

def getConversationChain(vectorstore):
    llm = ChatOpenAI()
    memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)
    conversation_chain = ConversationalRetrievalChain.from_llm(
        llm=llm,
        retriever=vectorstore.as_retriever(),
        memory=memory
    )
    return conversation_chain


def handleUserInput(user_question):
    response=st.session_state.conversation({'question':user_question})
    st.write(response)
    st.session_state.chat_history=response['chat_history']

    for i, message in enumerate(st.session_state.chat_history):
        if i% 2== 0:
            st.write(user_template.replace("{{MSG}}",message.content), unsafe_allow_html=True)
        else:
            st.write(bot_template.replace("{{MSG}}",message.content), unsafe_allow_html=True)


def main():
  
    #load environment varialbles
    load_dotenv();

    #Step1
    st.set_page_config(page_title="CHat with multiple pdf",page_icon=":books:")
    st.write(css, unsafe_allow_html=True)

    if "conversation" not in st.session_state:
        st.session_state.conversation=None
    if "chat_history" not in st.session_state:
        st.session_state.chat_history=None
    
    st.header("chat with multiple pdfs :books:")
    
    #Handle chat functionlity
    user_question=st.text_input("Ask a quesrion about your document")
    if user_question:
        handleUserInput(user_question)

   

    st.write(user_template.replace("{{MSG}}","Hello robot"), unsafe_allow_html=True)
    st.write(bot_template.replace("{{MSG}}","Hello Human"), unsafe_allow_html=True)  
    
    with st.sidebar:
        st.subheader("Your documents")
        pdf_docs=st.file_uploader("Upload your pdfs here and click on 'Process'", accept_multiple_files=True)
   
        #handle button click and processing
        if st.button("Process"):
                with st.spinner("Processing"):
                    #get the pdf text
                    raw_text = getpdftext(pdf_docs)

                     #get gets chunks
                    text_chunks=getTextChunks(raw_text)

                     #create vector embedding and store 
                    vectorstore=getVectorStore(text_chunks)

                    st.session_state.conversation = getConversationChain(vectorstore)

if __name__ == '__main__':
    main()

Conclusion: Unveiling the Treasure — A Conversational AI 🎁

Bravo, adventurers! 🎉 Together, we’ve traversed through terrains of code, creating an enchanting application that converses with us using the hidden knowledge within PDFs. Your treasure — a conversational AI — awaits further exploration and enhancement.

So, fearless coder, sail forth, explore further, and forge ahead, adding more enchantments to your application. The seas of coding adventure are boundless and filled with untold mysteries, awaiting your discovery! 🚀🌊

This article structure endeavors to ensure an enjoyable, enlightening, and step-by-step coding journey for adventurers (readers) of all skill levels. Adjustments for further simplification or enhancement are always a possibility, and your feedback sails this ship to better shores! 🏖🚢