Introduction
Before we start with this post, go here if you want to know why I was inactive for the last few months, I am back to posting regularly now :)
This is an interesting paper I read recently that has a novel way to parse user questions about a document. The key insight is that articles or research papers have a very structured format. In some cases, strict templates need to be followed. When this is true, it makes little sense to get context from just blobs of text which is what current approaches follow. PDFTriage proposes a way to train the model not only on the content but also on the metadata which provides the structure of the document. This is helpful because the AI model can now understand the mental model the user had in mind when writing the article.
This enables PDFTriage to accurately answer queries such as “Summarize pages 3-7”, “analyze data in table 3”, etc. In this post, let’s take a look at how it works.
Why do we need to have pre-processing?
It’s trivial but important to note that we need the pre-processing stage since for long documents it’s computationally infeasible to have the entire context in memory. And when we are analyzing tomes — which is what we ideally want our NLP Agents to do — we need to only extract the information relevant to the user query to get useful results.
Architecture
PDFTraige has 3 stages, Getting the document structure, Using the structure to get the relevant context from LLMs and then using the context to answer user queries.
This is the figure presented in the paper:
Stage 1: Generate Document Structure
The authors consider born-digital PDF documents — documents that are created completely electronically (not scanned) — and use the AdobeExtract API to create a DOM-like structure of the document. This is then converted to a JSON file and given as input to the LLM.
Stage 2: LLM Querying
As mentioned above, the input to the LLM is the JSON object which represents the structure and the following 5 functions:
fetch_pages
fetch_sections
fetch_table
fetch_figures
retrieve
The paper just mentions “each function allows the PDFTriage system to gather precise information related to the given PDF document, centring around structured textual data in headers, subheaders, figures, tables, and section paragraphs.”
Each function is used in a separate query to get multiple answers and combined to get the final context.
A small note on how to call the functions
OpenAPI has a detailed example here. However, I’ll just briefly explain how a function could be called based on the query.
You first define the function in Python, then pass it as a JSON object which has keys like description, parameters, etc.
This is then passed as an argument to the OpenAPI model call. In the response, the model returns an array of the function calls it needs to make based on the user query. You use this array to call the functions and then run a second query where you input the original query and the response of the function calls requested by the model.
Stage 3: User QnA
This is the prompt used to initialize the system:
“You are an expert document question-answering system. You answer questions by finding relevant content in the document and answering questions based on that content. Document: <textual metadata of document>”
The model uses the functions described above in addition to the context retrieved in stage 2 to output the final answer.
Conclusion
Even though the approach seems simple enough, I enjoyed reading this paper because it shows how we can make improvements to AI models by literally thinking about what a human would do in the given context. Since we have a mental model when writing anything, like this post for example, it makes sense that the AI model should be able to parse and reference it to get better answers.
That’s it for this issue. I hope you found this article interesting. Until next time!
📖Resources
Let’s connect :)