How We Made That App

Revolutionizing Language Models and Data Processing with LlamaIndex

Episode Summary

In this episode of How We Made That App, host Madhukar Kumar delves into the transformative journey of LlamaIndex with Co-Founder and CEO Jerry Liu. From the humble beginnings of the GPT Index to LlamaIndex's influential role in the data frameworks landscape, the episode explores the groundbreaking impact of retrieval augmented generation (RAG) technology. Jerry Liu discusses the evolution of AI, emphasizing LlamaIndex's pivotal role in reshaping the industry, and invites the audience to contribute to the open-source initiative, showcasing how they collectively shape the future of AI interaction.

Episode Notes

On this episode of How We Made That App, host Madhukar Kumar welcomes Co-Founder and CEO of LlamaIndex, Jerry Liu! Jerry takes us from the humble beginnings of GPT Index to the impactful rise of Lamaindex, a game-changer in the data frameworks landscape. Prepare to be enthralled by how Lama Index is spearheading retrieval augmented generation (RAG) technology, setting a new paradigm for developers to harness private data sources in crafting groundbreaking applications. Moreover, the adoption of Lamaindex by leading companies underscores its pivotal role in reshaping the AI industry.

Through the rapidly evolving world of language model providers discover the agility of model-agnostic platforms that cater to the ever-changing landscape of AI applications. As Jerry illuminates, the shift from GPT-4 to Cloud 3 Opus signifies a broader trend towards efficiency and adaptability. Jerry helps explore the transformation of data processing, from vector databases to the advent of 'live RAG' systems—heralding a new era of real-time, user-facing applications that seamlessly integrate freshly assimilated information. This is a testament to how Lamaindex is at the forefront of AI's evolution, offering a powerful suite of tools that revolutionize data interaction.

Concluding our exploration, we turn to the orchestration of agents within AI frameworks, a domain teeming with complexity yet brimming with potential. Jerry delves into the multifaceted roles of agents, bridging simple LLM reasoning tasks with sophisticated query decomposition and stateful executions. We reflect on the future of software engineering as agent-oriented architectures redefine the sector and invite our community to contribute to the flourishing open-source initiative. Join the ranks of data enthusiasts and PDF parsing experts who are collectively sculpting the next chapter of AI interaction!

Key Quotes:

“If you're a fine-tuning API, you either have to cater to the ML researcher or the AI engineer. And to be honest, most AI engineers are not going to care about fine-tuning, if they can just hack together some system initially, that kind of works. And so I think for more AI engineers to do fine-tuning, it either has to be such a simple UX that's basically just like brainless, you might as well just do it and the cost and latency have to come down. And then also there has to be guaranteed metrics improvements. Right now it's just unclear. You'd have to like take your data set, format it, and then actually send it to the LLM and then hope that actually improves the metrics in some way. And I think that whole process could probably use some improvement right now.”
“We realized the open source will always be an unopinionated toolkit that anybody can go and use and build their own applications. But what we really want with the cloud offering is something a bit more managed, where if you're an enterprise developer, we want to help solve that clean data problem for you so that you're able to easily load in your different data sources, connect it to a vector store of your choice. And then we can help make decisions for you so that you don't have to own and maintain that and that you can continue to write your application logic. So, LlamaCloud as it stands is basically a managed parsing and injection platform that focuses on getting users like clean data to build performant RAG and LLM applications.”
“You have LLMs that do decision-making and tool calling and typically, if you just take a look at a standard agent implementation it's some sort of query decomposition plus tool use. And then you make a loop a little bit so you run it multiple times and then by running it multiple times, that also means that you need to make this overall thing stateful, as opposed to stateless, so you have some way of tracking state throughout this whole execution run. And this includes, like, conversation memory, this includes just using a dictionary but basically some way of, like, tracking state and then you complete execution, right? And then you get back a response.And so that actually is a roughly general interface that we have like a base abstraction for.”
“A lot of LLMs, more and more of them are supporting function calling nowadays.So under the hood within the LLM, the API gives you the ability to just specify a set of tools that the LLM API can decide to call tools for you. So it's actually just a really nice abstraction, instead of the user having to manually prompt the LLM to coerce it, a lot of these LLM providers just have the ability for you to specify functions under the hood and if you just do a while loop over that, that's basically an agent, right? Because you just do a while loop until that function calling process is done and that's basically, honestly, what the OpenAI Assistance agent is. And then if you go into some of the more recent agent papers you can start doing things beyond just the next step chain of thought into every stage instead of just reasoning about what you're going to do next, reason about like an entire map of what you're going to do, roll out like different scenarios get the value functions of each of them and then make the best decision And so you can get pretty complicated with the actual reasoning process that which then feeds into tool use and everything else.”

Timestamps

(1:25) Llamindex origins
(5:45) Building LLM Applications with Lama Index
(10:35) Finding patterns and fine-tuning in LLM usage
(18:50) Keeping LlamaIndex in the open-source community
(23:46) LlamaCloud comprehensive evaluation capabilities
(31:45) The future of the modern data stack
(40:10) Best practices when building a new application

Links

Connect with Jerry

Visit LlamIndex

Connect with Madhukar

Visit SingleStore

Episode Transcription

Madhukar: [00:00:00] Welcome to this episode of How We Made That App. I'm your host Madhukar Kumar. I started off my career as a developer and then eventually moved to product management and then finally into marketing. In today's episode, I am extremely excited to welcome Jerry Liu, CEO and co founder of LlamaIndex, and we are going to talk about retrieval augmented generation as well as applications.

Welcome, Jerry. It's really good to have you here and really good to talk to you again. I think the last time we spoke was at Signal Store Now. That was a few months ago and looks like a decade has passed in AI since then. So to our audience, do you mind saying a little bit about what is LlamaIndex and your role at LlamaIndex?

Jerry:

Yeah, sounds great. First thanks Madigar for having me on. I'm Jerry, I'm co founder and CEO of a company called LlamaIndex, and for those of you who don't [00:01:00] know, LlamaIndex is a data framework and platform for LLM application development. And so we have a very popular open source project that's at the forefront of enabling developers to build applications like RAG, Agents LLM applications over their data.

We have, you know, a lot of monthly downloads, 1 to 2 million, used by companies like Uber, which ones can I say? Anyways, so they're listed on the website. There's a bunch of logos. There's Red Hat, there's Adyen, NT Systems, and a bunch of other ones that I'm not authorized to say. So basically, you know, feel free to check it out.

It's at llama next. ai and excited to be here.

Madhukar: Thank you for that. So, when you started this project on GitHub, and I, last time I was checking, I think it's about 30, 000 stars, 2, 900 times, it has been forked. And I was reading on your website that you've had over 2. 8 million downloads in a month. So, you've come a long way.

But when you started, which I believe was early of 2023, was that still called as LlamaIndex at that time, or was that called Run [00:02:00] Llama or some other name related to Llama? Is that correct?

Jerry: Yeah. So at the time we started the project, it was called GPT index and it came out before chat GPT came out and it started off as a side project back in October of 2022, actually, or early November when I was hacking around on LMS and trying to use DaVinci three on top of some data that I had.

Floating around. And so that was really the inspiration for the first iteration.

Madhukar: And at that time, I'm assuming there was no Llama, which was later released by Meta. So how did you come up with the Llama name?

Jerry: We came up with this name before Llama did. And I just want more people on the internet to know that.

So that we show up higher on the SEO results. The llama name came from us wanting to rebrand a little bit away from GPT because GPT just seemed very OpenAI centric. A lot of people think GPT, you know, it's OpenAI something. And so we wanted to pick a cute animal and that was just one of the options. And then we basically settled on like llama as a prefix just because it had the letters L [00:03:00] in the name.

And then. We wanted to use that as a prefix for a lot of our other offerings. And so that's why we have LlamaIndex, which is, you know, the company name and also the core project. But we also have LLAMA Hub, which is the overall center of integrations of that contain you know, data loaders, LLAMs, vector stores, basically all third party integrations in the SQL system, and then also LLAMA Cloud and LLAMA Parse, of course, and other LLAMA related things, too.

Madhukar: Yeah, and I've been following LlamaIndex, I think, since last February or March. And at that time, when even I was playing around with it, or within that single store too, when we were looking at a bunch of Gen AI application, RAG was not a thing, or Retrieval Augmented Generation, as it has now become a standard.

What was your thinking? I know personally, I was calling it like an in context learning, or even, you know, We were confusing it with fine tuning, which is a totally different thing, and I'd love for you to even talk about that as well. But what was your thinking, like what was the problem you're trying to [00:04:00] solve when you came up with GPT index, and then how did it evolve into retrieval augmentation or RAG?

Jerry: Yeah, I mean, the overall problem that we tried to solve since the very beginning, and by the way, that we'll continue to solve, even whether or not the name RAG, you know, continues or changes in some form, is to basically enable any developer to build applications on top of their data. So basically harnessing the power of language models, and basically making it very effective for people to build stuff on top of any private sources of knowledge that they have.

Whether that's, you know, unstructured text, whether that's that you then put into a vector database, Whether that's like a SQL database, whether that data is hidden behind an API. And so our goal is to basically help create the tools to enable developers to build that type of data stack as well as the application layer code to help them build these types of applications.

I think the history of RAG, yeah, I mean, I think it basically emerged a little bit around like mid last year, around March to June of last year. Before then, people called it different things, like in context learning, some aspect of just you know, Embedding based retrieval and [00:05:00] question answering. And honestly, it was just like one of those things where people generally gravitate towards catchy three letter acronyms, like LMS or a thing.

Multimodal models don't yet have an acronym and I imagine they'll get one soon because it's kind of annoying to try to spell everything out. And then RAG just becomes like a catch all term, right, for basically anything related to this overall like kind of framework of question answering over your data.

Madhukar: I know from a LlamaIndex perspective, I love the way it's organized right now, where you have a bunch of libraries to ingest the data, then to store it across multiple data stores, and then retrieve it. And very recently, I believe you also added evals, right? So, When you look at the overall landscape of just retrieval augmented generation, what are some of the applications you see in the market?

Like what are companies building? The reason I ask is for several few months now, a lot of companies and developers have been kind of doing POCs, pilots, and experimenting. But what are some of the things that you've actually [00:06:00] seen go into production and where's the general Overall use cases that you see when people use LlamaIndex.

Jerry: I think given our focus on this whole idea of uncovering insights from data, I think a lot of the use cases you see with LLAMAindex tend to be just the practical stuff, like being able to extract insights from different types of knowledge sources. So this includes like PDFs, I think is probably one of the most popular data sources.

Being able to index like HTML web pages. And so, a big chunk of our use cases include question answering, search, chatbots, and a variety of basic different types of UXs that enable users to find and surface and retrieve information from their data. That's probably by far the most popular use case.

I think, In the world of like agents or anything that resembles a bit more like workflow automation, so not just extracting information, but also being able to take actions over it. We do have that too. I think we actually have some pretty good abstractions. I think people for us, like they tend to use our agents more for the purpose of knowledge extraction and synthesis.

But I think in general, [00:07:00] Probably this year, as companies move beyond the POC like chatbot, they're going to start adding in some basic like plugin or workflow functionality that basically allows the LLM to do stuff like function calling and tool calling on different APIs. And that's where I think you'll start to see a little bit more and more of Just these agentic capabilities where this chat interface can not only just surface information, but actually take actions for you, too.

And so, I think, in general, it's probably some sort of conversational assistance. I think that's probably one of the biggest use cases.

Madhukar: So, if I were to break it down into two big buckets, the first one is information synthesis against your own data, and then, Augmenting that with LLM, and the second one is age and take where you use it also for knowledge retrieval, but maybe you do something with it as well.

So one thing that I most commonly see is that people start off with just kind of putting their own bot out for customer support, and they might do it internally, and then later on they might expose it externally. Even in information retrieval, do you [00:08:00] see some pattern of what kind of apps are emerging and where most of these apps are being built?

Jerry: I think if you're, if part of the question is basically internal versus external facing, I think we're seeing, yeah, I think, as you said, a lot, I think probably the majority of applications start off being somewhat internal facing because users want to figure out how to use this to kind of, on top of their own data, like whether it can demonstrate value before giving it to the customer.

We have seen external facing applications too, like some of our users basically have built like external powered chatbots powered by Laminux under the hood, like for their end users. That said, I think especially for bigger, say, financial institutions is like a big one. You want to throw in a lot of your own internal data.

It tends to be like unstructured, semi structured, structured. And you want to be able to give, build some sort of tool that can analyze that data and give you back insights. And that's both like an internal tool and also something that's very document heavy. I think in general, we probably see a good chunk of data sources for RAG tend to be like document heavy workloads.

So, for instance you have a [00:09:00] bucket of files that you want to read over, and this is something that, yeah, we have, we've actually probably seen this in the majority of use cases, just there's some bucket of PDFs maybe CSVs docx files, that you just want the reader to ingest. And that's actually something that if you look at out of the box chat2bz, quad, all these initial chat interfaces the base UI layer is always just being able to upload your own files, right?

Because it's just a very, almost like a table stakes UX functionality to enable. I think being able to connect to different providers and commonly used API services, that's starting to become more common. Some of our most common data loaders include Notion, Slack, as well as Google Drive.

Well, Google Drive is basically files, too, so that doesn't really count. But for some of the other commonly used services, we see more and more people heading these APIs to load data so that they can index it and build it up with it. I think one thing that we haven't done, but I think a lot of people have tried to do, as in we haven't really invested like a lot of like effort in education, is like code indexing.

So being able to index like an entire repository of code, just because I think a lot of the logic to [00:10:00] basically do that while requires a lot of kind of custom, like you need to construct like an AST, you need to basically model the relationships between different things, which is definitely possible in LOM index, but tends to be a bit more domain specific.

But that has been something that we've seen.

Madhukar: And, you know, the other question I have just around that patterns is, since you've seen what kind of applications are being built, both in open source as well as the enterprise version, which I do want to talk about a little bit more, but do you also see a pattern in what kind of LLMs are being used?

Are they mostly, is it 50 percent open source and 50 percent commercial? Is there, or is it more towards 70, 80 percent open AI and the rest is? Open source what do you see in terms of usage patterns? It's

Jerry: probably 80 percent OpenAI and or Azure OpenAI. And then 20%, the other 20 percent is probably like 10 to, 10 to 12%, 10 to 15 percent maybe open source, and then the rest, like everything else.

But, you know, I think these things, honestly, I don't know. I feel like these things will [00:11:00] change over time. And I actually don't think there's a lot of stickiness with a given model provider. It's really just people switch to whatever is going to work the best. And so, you know, if Cloud3 Opus does better than Triptych 4, I don't know what the recent usage metrics are, but I'm sure there's like a healthy percent of people switching.

Madhukar: Yeah, and that's why I like that in LlamaIndex, you're kind of agnostic, so you can choose to swap out the model with the ease of just changing the config. Right, exactly. So, then moving on to the other side of the rag, which is the data store, you mentioned they're both structured, unstructured data. And one of the interesting use cases, at least for me, is, you know, and I, we do this internally as well for marketing, is The talk to PDF is kind of now, just like you said, becoming commoditized, especially when the context windows become bigger, so you can just upload your book or whatever, although there are pros and cons of sending a very large context, but the one that's very interesting to me is, let's say you upload your data, which is CSV, free.

And you're not only talking to the data in NLP [00:12:00] and it's giving you insights, but it's also giving you analytics as well. So, that's like a perfect example of both structured and unstructured data, but you're also doing analytics. So, when it comes to data stores, do you see, what's the, again, what's the ratio?

Do you see most of them vector only databases, or do you think these are databases like PG Vector or, You know, basically Postgres with PgVector, or is it two or three databases mostly? What are some of the, again, common patterns over there that you see?

Jerry: Actually, I'd probably think about it a little bit differently, in that people either want to connect to just a very raw data source, because that's what they're familiar with.

When they look at like the Chai GPT UI, you can upload a raw file and just start asking questions over it. And then the question is, what's the stack you need for that? And then people also want to connect to like operational data, which is data that's already live within some, the data stack that they're using.

And then how do you make use of that and get that one to interact with it? Because those two typically [00:13:00] are actually pretty different stacks. We have like a vector, like the first part is basically a lot of the data processing for RAG. So you load in a PDF and then you do some sort of chunking indexing, embedding.

And then that goes into a vector store system, you know, and I know like single store supports vector indexing. And then the other part is just like text to SQL. I mean, that, that's sorry, that's like a implementation of this overall idea, which is you connect to a data system via a predefined API. And then you figure out how to interact with that API.

And I think those two are actually both pretty common. I've seen a lot of companies try to make use of their own operational data, and then also companies try to make use of just untapped raw data. I think within the untapped raw data space, PDFs are probably by far the most common file format, followed by probably somewhere like CSVs, Excel sheets, StockX files, those types of things.

And then for structured data, or sorry, just like operational data, yeah, that's typically just, yeah, like a SQL database, whatever format it is. And you want to be able to query over it to basically understand insights.

Madhukar: And over a [00:14:00] period of time, I know that the, especially with whatever we are seeing in AI, the latency for getting the data from your data sources has decreased, right?

So, for example, with some advanced RAG, patterns, you can choose to send only some queries to your RAC pipeline, the rest you can send directly to LLMs. And now some of the LLMs with grok and such, even that latency has gone down. So do you see any pattern in a, what we call a live RAG, where your data is born every millisecond and that's being used to vectorize as well as being queried in real time or call it active RAG or whatever.

And two, does that lead to. to more of live Gen AI use cases where, you know, it could be a live video stream and then you're doing both RAG as well as inference at the same time, but less than a second of a latency. Have you started to see some use cases like that?

Jerry: Yeah, I [00:15:00] guess I'm still thinking a little bit about what that means specifically.

So far, I think it's mostly use case dependent. I think for a lot of document heavy workloads, like certainly the data changes over time, but maybe a little bit slower. And I think a lot of people are just still stuck in that of I gotta have a lot of documents, how do I like input this so I can get back a response?

I think for live settings, that tends to be for maybe more user facing or like places where the data is like streaming or real time. And we have seen that, but probably in a minority of use cases so far. And like part of that is basically just you are able to load in new data into your storage system and somehow have a good pipeline for incremental updates.

And that's certainly something that we support within Llamadex and we want to continue supporting. And then once that data is live within the database, then, you know, the retrieval, like whatever, like whenever you run the RAG pipeline, it will always have access to the latest and freshest data, right? I think that's actually one of the advantages of RAG versus, say, doing pure fine tuning to try to memorize the knowledge, because [00:16:00] Any sort of like training over the, over any new sources data will inherently take some time.

Whereas with RAG, if then all you have to do is just load the data into the factory database, that can get very fast.

Madhukar: Yeah, and maybe let's talk a little bit more about fine tuning versus RAG. You know, at one point, of course, fine tuning was you take like a key value pair or your query and your answer, your response into a data set and then load it into your.

LLM and then basically ask it to respond that way. So it affects the behavior per se versus giving it more knowledge, right? And then, of course, with RAG, you're also giving it knowledge, but it's more in context. Has fine tuning changed? Have you seen any use cases around fine tuning that has evolved from that and where more and more users of companies are using it differently?

Jerry: Honestly, no. I think most fine tuning UXs right now are pretty imperfect. I know people are certainly working on it, but it's just not there yet. Even with OpenAI's fine tuning updates a few weeks, months ago where [00:17:00] basically, you know, you can fine tune like GPT 3, now GPT 4 with like different types of data.

There's just this fundamental UX problem of basically, you're either a AI engineer that just wants to go in and build a product with LLNs or you're an ML researcher that already knows how to write PyTorch. And so if you're like a fine tuning API, you either have to cater to the ML researcher or the AI engineer.

And to be honest, most AI engineers are not going to care about fine tuning if they can just hack together some system initially, that kind of works. And so I think for more AI engineers to do fine tuning, it either has to be such a simple UX that's basically just like brainless, you might as well just do it.

And the cost and latency have to come down. And then also there has to be like guaranteed metrics improvements. Right now it's just unclear. You'd have to take your dataset, format it, you know, and then actually send it to the LLM and then hope that it actually improves the metrics in some way.

And I think that whole process could probably use an improvement right now.

Madhukar: So let's move on to the enterprise version. You recently launched Llama Cloud. Is that [00:18:00] generally available right now, or is it only by invite?

Jerry: I think, so good question. So maybe just taking a step back, you know, Wamandux, we've been open source for about a year, and we, you know, We want to, we absolutely want to continue supporting the open source community.

I think it's very important to us for a lot of different reasons. One is just, you know, we love just like educating users on just what are some of the new use cases that are emerging? What are some cool techniques? You know, we put out a lot of stuff and we have very, a very talented set of like employees, but also community members that contribute a lot of great content.

And so that stuff is absolutely not going to go away. One thing that motivates the enterprise platform is basically, as we talked to a lot of. Enterprise users are learning certain pain points that they are running into as they try to go to production. And so some of these issues include, you know, you're setting up a RAG pipeline that works fine when you prototype it in a POC, but then actually when you try to productionize it, There are certain issues that pop up.

This includes response quality issues, like retrieval [00:19:00] and generation based issues. This also includes you know, you have a hard time scaling to new data sources. You're having a hard time formatting this data in the right format, so you can use it with the outline. And then also you're having a hard time tuning the model itself.

So after taking a look at a lot of this, we realized, you know, The open source will always be an unopinionated toolkit that anybody can go and use and build their own applications. But what we really want with the cloud offering is something a bit more managed, where you if you're an enterprise developer, we want to help solve that clean data problem for you.

So that you know, you're able to easily load in your different data sources, connect it to a vector store of your choice. And then we can help make decisions for you so that you don't have to own and maintain that. And that you can continue to write your application larger. So Wama Cloud, as it stands, is basically a managed parsing and ingestion platform that focuses on getting users like clean data to build like performant RAG and LLM applications.

I think that's one of the key goals. Llama Cloud [00:20:00] as a whole is in a private preview. So we are working with a few design partners to basically build this out. And, you know, we'll probably open it up a bit more publicly in a few months or so once we have a core set of these features that we're confident that a lot of people will use.

LlamaParse, though, is a very specific piece of LlamaCloud that does one thing in that overall stack of parsing, ingestion, and retrieval, which is parsing. So, parsing is specifically a document, a very specialized PDF parser, that is very good at parsing a document into the right representation. And we can talk about, like, why we think this is useful for us, but basically, you know, the A lot of use cases, like I just mentioned, include just like very heavy document workloads and trying to build RAG over these documents.

And our parser is actually basically tailored for data extraction from these documents in a way that like lets you build optimal RAG over these documents. So even for very complex PDFs with a lot [00:21:00] of tables, charts, figures, as well as weird formatting and sections, Our goal is to extract that representation very faithfully so that you can use it with a language model of your choice.

And that's what we designed it for. And it turns out it's actually been, it's been seeing a decent amount of usage. We've hit 1 2k users within the first 1 2 weeks. A lot of users are using us. We've hit, in terms of just a number of pages you know, a lot of, Pages processed so far, and there's more to come.

And yeah, I think that's just an initial hook that I think is something that's very useful for people building Rack.

Madhukar: That's phenomenal. So if I understood you correctly, what I would do is I'll set up my vector store in Llama Cloud, then I would connect my different data sources through the connectors, which I'm assuming is a low code or maybe a code version, and then I can use the Llama Cloud SDK to parse through all my documents that I might have, which is through that parser it goes in and then goes straight to my cloud.

And then using my SDK I can then do the [00:22:00] retrieval and put it out to my application layer. Is that right?

Jerry: So the only modifications I would make is that LlamaParse itself is just a standalone API right now, so you can use it completely independently of the rest of LlamaCloud if you want. At the rest API we have a Python client and a JavaScript client as well.

But yeah, it also hooks into the rest along the cloud, which is exactly as you described. You're able to just have a nice managed workflow, if you want to define your parsing strategy, but also your data source and data sync. And then we can just manage that pipeline and help you run it in production. And

Madhukar: what about debugging and explainability?

For example, if I'm trying to do evals and a lot of the queries are coming in, am I able to go back and see, yes, this was a good match or not a good match, or it was not, even close to the ground truth. Do you have the evaluation part of it also as an SDK or is that in the cloud?

Jerry: Yeah, so that's a great question.

We have pretty comprehensive eval capabilities on both the open source and our main goal on Llama Cloud is to basically tie these metrics to data quality so that you [00:23:00] basically have a view of how good your data is with respect to any downstream application logic that you're running. On the open source, we have our own evaluation modules.

But there's also an entire very healthy ecosystem of evaluation and observability providers that we integrate very deeply with. And we have like first class support for companies like, you know, we have OpenLemmetry, like different, I'm like forgetting, we have five to eight to 10 different integrations on the observability per page.

But we have a long list of different partners that either have like good eval toolkits, TrueLens from Treyar is another one, Ragas is another one, to observability to

Madhukar: both. So let's come back to Agents for a second because that's one of those things that's very intriguing to me as well. So when I think about Agents, I think it's OpenAI's version is Assistance API, right, or maybe In auto gen, which is an orchestration framework, you have a definition of a agent as well.

And typically in those frameworks, there are three different [00:24:00] things to it. There's the knowledge, which is basically all your structured and unstructured data. Then the set of tools or function calling. And then third is of course, access to the LLM in LAMA index. Is that how you define an agent as well?

It's a completely different construct.

Jerry: Yeah, we have a lot of hard talks talking about what does it mean to be an agent, and what are the layers of agents. I typically think of agents as layers of different capabilities, and the full thing, which contains all of these things, we call, you know, an agent within LlamanDocs, and there's some different implementations of that.

But really, if you think about layers of what agentic capabilities mean, it's basically just using the L to reason about things. That's really it. And then you can get like arbitrarily complex, or not so, with like how you use the LLM to reason about things. But the key idea is just prompts, LLMs, and then you use it to do interesting decision making.

So some examples include just like from a very basic first principles level. Being able to pick a set of choices given a query. So being able to pick multiple choices given a [00:25:00] query. That's typically what routing is. So you take in a question and then, or a task, and then you route it to an underlying set of choices, right?

Whether it's one or more than one. That's agentic because you're using the LLM for decision making. Another is like query decomposition. So given a question, Break it down into sub questions. This starts getting into both chain of thought as well as query planning. So just given something like try to plan out a map of like how that question can be decomposed so that you can, you know, execute each one independently and synthesize an answer.

And then you start getting into tool calling, right? So tool calling is definitely agentic because you're using an LLM to infer the parameters of some API, whatever that API might be. to try to interact with some data system to get back a response. That API could be, you know, your standard REST API, where you infer some parameters to hit that REST API and you get back an answer.

It could be like the interface of a SQL database or VectorStore. So I think we actually chatted about this, Matic, I don't know if you remember, with SingleStore, for instance, like trying to get [00:26:00] The LLM to write SQL with the right syntax. If, for instance you could actually get it to do both semantic search and structured query, that's also an example of tool calling, because you're trying to get the LLM to interact with the SQL database.

And then there's other, there's plenty of other examples too. Key idea is you try to get the LM to interface with the external environment and you get back something, right? So all, so, so, like, all this stuff you can do in a one shot manner as in you call it once and you get back a response.

That certainly is agentic. But you could also do it in a repeated manner, too. You could, for instance keep iterating on something until it is complete. You can continue to do call tools until, you know, you feel like you got the right answer. You could also just set a fix for a loop counter.

And so, that's where you start getting into the full agent definition, which is you know, you have LLMs that do decision making and tool calling, and so typically, if you just take a look at a standard agent implementation, it's some sort of query decomposition plus tool use. And then you make that loop a little bit, so you run it multiple times.

And then, by running it [00:27:00] multiple times, that also means that you need to make this overall thing stateful, as opposed to stateless, so you have some way of tracking state throughout this whole execution run. And this includes conversation memory, this includes just using a dictionary. But basically, it's some way of tracking state.

And then you complete execution, right? And then you get back a response. And so that actually is a roughly general interface that we have a base abstraction for. And we started implementing a lot of papers according to this interface. So there's React, which is like the probably the most standard agent implementation.

It's like the most classic one. To all, a lot of LLMs, more and more of them are supporting function calling nowadays. So under the hood within the LLM. They already the API gives you the ability to just specify a set of tools so that the LLM API can decide to call tools for you. So it's actually just a really nice abstraction instead of the user having to manually prompt the LLM to coerce it.

A lot of these LLM providers just have the ability for you to specify functions under the hood. And if you just do a while loop over it, that's basically an agent, right? Because you just do a while [00:28:00] loop until that function calling process is done. And that's basically honestly what the OpenAI Assistance agent is like.

And then if you like. Then, you know, go into some of the more recent agent papers, you can start doing things beyond just like next step chain of thought into kind of, you know, every stage. Instead of just reasoning about what you're going to do next, reason about an entire map of what you're going to do, roll out different scenarios, get the value functions of each of them, and then make the best decision.

And so you can get pretty complicated with the actual reasoning process, that which then feeds into tool use and everything else.

Madhukar: Yeah, and you kind of alluded to the orchestration piece earlier saying you are also working on the workflows for letting the different agents talk to each other and collaborate on a common goal.

I Is this similar to Autogens Studio or a crew? AI or the way you're looking at from a LAMA index perspective is completely different. Or similar. Honestly,

Jerry: yeah, probably not that different. I mean, I think these are all great projects. [00:29:00] Obviously, our goal isn't necessarily to compete with all of these.

But we do want to do our own agents. And so far, honestly, we haven't put a ton of focus on multi agent stuff. But we probably will soon. I think so far, we've primarily focused on agents as a tool to help users get enhanced knowledge synthesis, as you mentioned in the beginning, out of their data. And typically, you can solve that with a single agent or some sort of hierarchy of agents of your documents.

And but for something that's kind of more sophisticated with you know, different personalized agents, each with their own memory that can autonomously act and do things. We haven't really done that yet, but I think we are going to allocate some engineering cycles and make the engineering abstractions, like the agent abstractions, more prominent, and also come out with a lot more resources on that.

Madhukar: Yeah, and stepping back, I'm very interested to hear your thoughts in general software engineering. Where is that headed? The way I was thinking about was, went from monolithic, you know, C modules and so on, to then object oriented, and then [00:30:00] finally to microservices, which is packaging all of these. Do you think agent oriented could be the future?

Or do you think it's just one of the many ways that we do today and how we might build tomorrow as

Jerry: well? Yeah, I mean, I mean, yeah, I think that's very interesting. I don't really have a thesis on whether we'd stop writing like Python. I feel like we'd still just write Python. I honestly think writing for my take is for a lot of these like new pieces of software, you don't really need to just invent a new programming paradigm.

You can just continue to write, you know, and continue to use libraries as is, but just figure out how to compose models in interesting ways. But I do think in terms of the higher level architectures, I think that's very interesting to think about. I think one thing that we've written about is that the modern data stack that's LLM powered is going to look different.

I mean, just by building a RAG pipeline, you're inventing a new data stack as opposed to using an existing one. And especially as we believe that more and more data sources will be fed to LLMs for decision making and synthesis. You're going to start just [00:31:00] this whole stack of you know, injection, vector databases, like transformations to try to get the LLM to understand stuff.

That's just going to be a thing that emerges and becomes more prominent, right? Versus just directly like transforming stuff into your data warehouse. The other piece here is the like agent services. I think in terms of just very practically, like in terms of API design, one thing we haven't really done a lot of is really be thoughtful about what API design looks like.

when LLMs interact with ABI versus humans. A lot of the ways that we think about designing a contract between client and server is based on just like the human can understand how to fill that with the right parameters or like through an algorithm. But if you want to make things friendly for an LLM to interact with, whether it's like a search engine, whether it's like trying to do authentication I think that actually looks pretty different.

And I think more and more services will merge to enable LLMs to basically traverse like the internet to interact with different services. And best practices will emerge there, because I think having that is pretty important to let, to create these [00:32:00] multi agent networks that can operate effectively and are also fault tolerant.

Madhukar: Yeah, absolutely. So, you know, coming back to agents and RAG, one thing I've been thinking about is if agents can do function calling and have access to LLMs, then let's say I create an agent that can talk to a database and maybe another agent that can talk to a vector database. So, in terms of RAG, Then do you see that evolving as well towards something like this where maybe the quote unquote, it's hard to say, that whatever we have been doing now could be the traditional way of doing RAG and RAG itself changes in the next couple of months because of agents and other things that are about to come out.

Jerry: mean, absolutely. I don't think, I think right now RAG really isn't using any of the capabilities that the NLM has to offer beyond synthesis and generation. So it's actually quite limited in that regard. I think more people including, I mean, we have a bunch of these abstractions like, like you're, like, you [00:33:00] probably want to add some sort of agentic layer on top of any sort of vector search capability that you're offering just to give the the RAG pipeline, if you will, advanced query understanding.

Right? Even if you're just doing search. And actually, we're starting to see that in a lot of our users, too. They're building some sort of agent second interface with a VectorDB as a tool. Right? And a lot of that is powered into one by LlamaIndex. So yeah, I think in general, the RAG pipeline almost certainly is going to evolve beyond.

I don't think it should stay at the, you know, just like top k work for people on synthesis. More people are doing the more interesting things. And I think it will continue to evolve and change over time.

Madhukar: And do you think, I know nobody can predict the future, but do you think at any point, either OpenAI or any LLM, they directly add the connectors within the large language model to the different data stores?

Or do you think They probably will never do that because they're not in that business.

Jerry: I mean, I think they've done parts of it. I think what they'll certainly do is start offering end to end RAG APIs. I mean, they already [00:34:00] are doing that. Feel like an anthropic OpenAI. Let's see, here, most LLM providers actually, there's some form where you can just upload some set of documents and basically treat it as an API so you can ask questions over it.

One of the current challenges with any of those things is that if it's too black box, users run into issues like with like response quality, and then they realize they have to resort to using a framework anyways. And then the second thing is that I think given the competitive landscape as we're discussing, right, There's new models coming out every day, users want flexibility, options, and being able to optimize for specific use cases and needs.

And so actually tying like a RAG implementation to the model provider itself seems somewhat suboptimal in many cases, and users want the ability to choose between different LLMs. I think one thing that is very interesting, though, is that by owning both vector search capability along with the model, you can do more interesting native integrated architectures where you know, you basically bake in vector search into the model, right?

Instead of the developer having to stitch together two disparate [00:35:00] pieces together. Yeah, I mean, I think that part is probably going to emerge. But, I do think, you know, regardless of what models come out with these types of like new integrated architectures, we still have a bet that there's just going to be new use cases you can build for developers, right?

For developers to build on top of these types of models, to basically connect to different data sources to build interesting applications. I think these model providers will build these types of architectures. They probably won't build like 150 connectors to all the different data sources. Just because that's that like in itself is honestly a pretty hard business, right?

But and you don't really need to do that, but they are going to offer higher level out of the box settings for stuff like RAG.

Madhukar: Yeah, and actually come to think of it, even OpenAI has today connectors into some of the vector stores, including single store. But yeah, I've not seen people switch over just using OpenAI for all of their RAG pipeline as well.

In terms of you know, applications that you've seen [00:36:00] with or without RAG what are some of your most favorite and things that you really like of some of the creative apps that are being built using GenAI?

Jerry: Yeah, I mean, I think I have, obviously, a company bias, which is the boring stuff that's useful.

Sorry, I shouldn't say boring. I think it's very interesting. But I think our current obsession is over like complicated PDFs, right? Just like, how do you solve that problem? You have a thousand of these. A lot of these PDFs are just really hairy. They have diagrams, they have like pictures and stuff.

And so we're working with a few enterprise design partners, basically, to actually use LlamaParse with the rest of Llamandex abstractions, basically really understand the content in these documents. And I think that's something that is both very practical and something that personally appeals to us.

I think for some of these other just more general applications, honestly, of how to think more broadly is. I mean, I think like coding assistance is interesting. We haven't spent a ton of time thinking about that. I think any sort of MPC based world where you're basically like a user and you have some sort of [00:37:00] like conversation state and you can basically create these mass simulations of different, of like an agent and you actually learn stuff over time.

Right. And you basically kind of can try to write new things as opposed to just regurgitating existing things. Basically any of that, like the consumer facing, like multi agent world, I find very interesting. It's not something that we really do right now, but I think that's, that part is very interesting to us.

Madhukar: All right, this is super interesting and I could go on forever, but I know we are coming up to time, so I'll have two more questions, Jerry, and then I will, we can close it off. So first and foremost, What I wanted to ask you is, in terms of a new developer looking to build something today, with or without RAC what are some of the best practices that you yourself have learned in the last 12 months that you would like a new developer to know before they start to go build their own application?

Jerry: Yeah, I mean, I think our docs are actually organized towards this, to try to prioritize like the simple things first. I would probably, one, just [00:38:00] eliminate eliminate complexity by one, just picking the best model that's within your budget, and two, you know, just pick the best thing, and then try following the basic RAG setups.

Try the simple stuff first, and if that doesn't work well try the more advanced stuff. The reason is just because you can get arbitrarily deep on the advanced stuff. We have like hundreds of guides showing you how to do very fine grained like tuning of like different strategies and techniques. But often times, a basic RAG stack just looks like the following.

You have a question. You want to do some sort of like query rewriting or agentic understanding on top of that. You do retrieval from a vector database and these days, you know, a common thing that people do instead of just dense retrieval is some sort of like hybrid search. So you do both like keyword filtering as well as embedding based lookup and you combine the two.

Afterwards you do some sort of re ranking. So this is actually a really nice second step because you can re rank the context to really determine which ones are the most relevant to the query. And then you get back and get through. That's honestly like a relatively basic stack. And [00:39:00] then from there, if that doesn't work, there's like a lot of different, like I would try to precisely identify the pain points that you're facing.

Because depending on the pain point, there's a set of solutions that you might want to take.

Madhukar: So, you know, just to close this off then, Jerry, again, thank you very much for doing this. Really appreciate it. Been a great partner and we continue to look forward together as well. Where can people find you if they have a question or they want to

Jerry: contribute?

We have an active Discord, we have a lot of members in there, and then we have GitHub issues as well, so we actually monitor both the Discord and GitHub. So if you're checking out the open source, please feel free to check out you know, WAMIndex. ai or the Docs page. And then the other piece is, if you're interested, you know, again, because we've been talking about LlamaParse and LlamaCloud, if you're interested in really helping to solve some of those data connection ingestion issues, come talk to us.

If you're interested in parsing PDFs, please feel free to try LlamaParse. Awesome. Thank

Madhukar: you so much, Jerry.

Jerry: Yeah. Thanks, Madhukar.