How We Made That App

Data Dreams and AI Realities with Premal Shah

Episode Summary

In this in-depth conversation, host Madhukar Kumar welcomes Premal Shah, the Co-Founder and Head of Engineering at 6sense! Premal discusses the trajectory and technological evolution of his company, giving fascinating insights into the world of data architecture, deployment processes, machine learning, and AI.

Episode Notes

In this engaging episode, host Madhukar Kumar dives deep into the world of data architecture, deployment processes, machine learning, and AI with special guest Premal Shah, the Co-Founder and Head of Engineering at 6sense. Join them as Premal traces the technological evolution of Sixth Sense, from the early use of FTP to the current focus on streamlining features like GitHub Copilot and enhancing customer interactions with GenAI.

Discover the journey through the adoption of Hive and Spark for big data processing, the implementation of microservice architecture, and massive-scale containerization. Learn about the team's cutting-edge projects and how they prioritize product development based on data value considerations.

Premal also shares valuable advice for budding engineers looking to enter the field. Whether you're a tech enthusiast or an aspiring engineer, this episode provides fascinating insights into the ever-evolving landscape of technology!

Key Quotes:

“What is important for our customers, is that 6sense gives them the right insight and gives them the insight very quickly. So we have a lot of different products where people come in and they infer the data from what we're showing. Now it is our responsibility to help them do that faster. So now we are bringing in GenAI to give them the right summary to help them to ask questions of the data right from within the product without having to think about it more or like open a support ticket or like ask their CSM.”
“We had to basically build a platform that would get all of our customer's data on a daily basis or hourly basis and process it every day and give them insights on top of it. So, we had some experience with Hadoop and Hive at that time. So we used that platform as like our big data platform and then we used MySQL as our metadata layer to store things like who is the customer, what products are there, who are the users, et cetera. So there was a clear separation of small data and big data.”
“Pretty soon we realized that the world is moving to microservices, we need to make it easy for our developers to build and deploy stuff in the microservice environment. So, we started investing in containerization and figuring out, how we could deploy it, and at that same time Kubernetes was coming in so with using docker and Kubernetes we were able to blow up our monolith into microservices and a lot of them. Now each team is responsible for their own service and scaling and managing and building and deploying the service. So the confluence of technologies and what you can foresee as being a challenge has really helped in making the transition to microservices.”
“We brought in like SingleStore to say, ‘let's just move all of our UIs to one data lake and everybody gets a consistent view.’ There's only one copy. So we process everything on our hive and spark ecosystem, and then we take the subset of the process data, move it to SingleStore, and that's the customer's access point.”
“We generally coordinate our releases around a particular time of the month, especially for the big features, things go behind feature flags. So not every customer immediately gets it. You know, some things go in beta, some things go in direct to production. So there are different phases for different features. Then we have like test environments that we have set up, so we can simulate as much as possible, uh, for the different integrations. Somebody has Salesforce, somebody has Mercado, Eloqua, HubSpot. All those environments can be like tested. ”
“A full stack person is pretty important these days. You should be able to understand the concepts of data and storage and at least the basics. Have a backing database to build an application on top of it, able to write some backend APIs, backend code, and then build a decent looking UI on top of it. That actually gives you an idea of what is involved end to end in building an application. Versus being just focused on I only do X versus Y. You need the versatility. A lot of employers are looking for that.”

Timestamps

(00:23) Premal’s Background and Journey into Engineering

(06:37) Introduction to 6sense: The Company and Its Mission

(09:15) The Evolution of 6sense: From Idea to Reality

(13:07) The Technical Aspects: Data Management and Infrastructure

(18:03) Shifting to a micro-service-focused world

(31:16) Challenges of Data Management and Scaling

(38:26) Deployment Strategies in Large-Scale Systems

(47:49) The Impact of Generative AI on Development and Deployment

(55:18) The Future of AI in Engineering

(01:01:07) Quick Hits

Links

Connect with Premal

Visit 6sense

Connect with Madhukar

Visit SingleStore

Episode Transcription

Madhukar: [00:00:00] Welcome to this episode of How We Made That App. I'm your host, Madhukar Kumar. I started off my career as a developer, eventually got into product management, and then finally into marketing. Today I have with me Premal Shah. He's the co founder. As well as the head of engineering for a company called Sixth Sense, a company that helps marketers with better data and better segmentation.

Welcome to the show Premal. I know we I've been talking quite a lot and you're one of my favorite customers, our favorite customers. But I always look forward to the conversations because I just love to learn about, you know, what your company is doing, what your engineering team is doing, given the background.

So I wanted to start off with a little bit tell our audience your background. How did you get into engineering? Where did you start? How did you eventually get to [00:01:00] Sixth Sense?

Premal: Yeah, so I was born in India and in Bombay and my dad has a business which was helping the banks do a lot of the reconciliation of their spend and he had a lot of computers and this is like in the 80s, so I got into you know, just Playing around and, you know, building computers early on and that passion kept growing into software engineering.

You know, learned all the different languages and eventually went to school for computer science in India. And then more to the U. S. Was doing a bit of, you know, software like computer networking. And computer science and eventually found my passion building software and websites and large scale systems and, yeah, the rest is history.

Madhukar: I was with some students last week at UC Santa Barbara. I was just blown away by the kind of stuff that they're doing now in the university. I [00:02:00] remember when I did my master's in software engineering, we were just told, okay, here are the big concepts, right? But we didn't really get into the hands on programming till we were in like a Java 101 class.

And then we built out a calculator. And today now I look at kids and they are building stuff like this team I was working with. They have this racing car and they have these sensors and they're using the data from those sensors using MQTT and then putting that data into our database single store and then building applications.

What was your experience like in the engineering school? Did you get to see a lot of hands on stuff, or was it mostly about the principles? I remember even the database class that I had was purely just principles and algorithms. It had nothing to do with real databases.

Premal: We actually got a lot of practical experience [00:03:00] while in the school projects.

There was a whole web development class where we actually had to like design and have a running website with a database attached to it. And in fact, even in like India, when we were in school, finally a project, we actually built a whole like appointment system. So, properly working, you know, in that. Point ASP.

NET or whatever those things, but ASP and then database classes. We did special databases, hierarchical databases. We had to build a whole application with Google Maps and find the nearest restaurant and stuff like that. So it was definitely pretty interesting and you would get a lot of practical experience when you find your own bugs and solve real problems.

Madhukar: What was the database at that time? I remember when I started working, the first job, it was a SQL server database, which was commercial, but I don't remember, for me at least, that there was an open source database like [00:04:00] MySQL at that time. Do you remember what Where you started off with.

Premal: Yeah, I think this was like we used to play around with MySQL.

I think when you're in college, especially in the US, you probably get like licenses to Oracle, like the, you know, the skinny versions and, um, whatever features they have. It's probably my memory. We weren't too much into Postgres, but yeah. Yeah,

Madhukar: I remember even with Oracle, they used to come in CDs.

And first you install Red Hat, the open source version, then you install the database, and then when you write up the application that picked up the data and showed it in rows and columns, it was, like, such a magical experience. How's that changed? What do you feel about it now, when if you were to start over, and if you were to do things all over again?

But today, like what would you choose to start off with?

Premal: Yeah, I mean, I really love MySQL. It's just, you know, easy to get started locally, and there are so many commercial and like [00:05:00] even AWS has so much support for it. You start off with that, especially when you're doing more like a small size transactional database, you know.

Doing your POCs. A lot of people love Postgres. My flavor is MySQL. Start off with a Python based web framework or, you know, if you're doing something very API heavy then maybe something like a Drop Wizard in Java. And, yeah, I mean, start with you know, simple HTML stuff and then make it more complex as time goes.

I've spent a lot of my career like, optimizing things in MySQL and really comfortable doing that. And when, obviously when you go to the very large parallel databases, then that's where you have to like, do the right research as to what is, what you can get your hands on as fast as possible. Yeah,

Madhukar: I remember with MySQL.

I used to work for a company called Optimost. It was acquired by Interwoven, where I worked with. And Optimost was one of the original web analytics company. [00:06:00] And we used to get all that data into MySQL. And there was a team of five people in New York, if I remember correctly. They were just there to optimize MySQL because as it got bigger and bigger, you know, you had to do partitions, you had to do sharding, then you had to do clustering, then you had to do DRM, talking about like at least 14, 15 years ago, and it was like a massive rocket science operation of trying to manage the data and make sure it never, you know, you don't lose the data, but also run analytics.

But before we go further down that Tell me or tell our audience. What is Sixth Sense? That's the company that you co founded. And what does it do? Who do you sell it to? What is it used for?

Premal: We started in 2013. The original mission was you know, we are predictive analytics for B2B sales and marketing.

And the problem we were [00:07:00] solving is allowing People, you know, who are selling and people who are marketing to understand who is actually interested in buying the product at that time. So whenever you are doing outreach, it is not cold. You are You know, somebody is researching your product, somebody wants to buy something that you are selling and that helps you cut down on time, effort, money, etc.

to open pipeline and eventually it's as, you know, you have a great product, you're going to make the sale. So, that was whole premise behind the idea. The, our fundamentals have not changed. The only way we deliver the product What kind of like different activation channels that we provide has changed over the years, you know We help you do advertising.

We help you do email. We hyper, help you hyper personalize email, help you do you know social channels like LinkedIn, Google, Meta and help the salespeople Really dig into [00:08:00] what their accounts have been doing, so that whenever they're having a conversation, they can tailor it correctly and exactly figure out like, you know, are the prospects looking at their competitors how many times have they been to my website, what kind of high value pages are they seeing, what are the other research they're doing across the internet.

So we are now a big intent data provider where we get it. Millions of website visits from different high value publishers every day that we collate and we show it in a single pane of glass to our customers, whether you're a marketer or a seller. That scale just

Madhukar: is mind boggling, at least to me, because we use, at SingleStore, we use Sixth Sense as well.

It's one of our favorite products and we use it to, you know, basically do our go to market. So, we figure out our Ideal customer profile, then we use Sixth Sense to enrich that data, and then we have a very targeted campaigns going. [00:09:00] But tell us, tell me, or tell our audience. A little bit about how you started the app.

Like when you thought about the idea, I'm assuming you were sitting in front of the computer and thinking, okay, what database should I have? What should be the architecture? Is this something that I'm going to, and I'm sure at that time there was no Kubernetes, there was no cloud. Tell me a little bit about your thinking process.

Premal: Yeah, we were very fortunate to have signed up one of Fortune 500 companies in the very beginning. So we had a You know, a very decent scale of data coming into the system on a daily basis

Madhukar: and you were collecting the data directly through third party or how was the data?

Premal: So in the beginning you know, we had an FTP server.

The customer used to constantly or hourly basis, it would upload the data. Drop their website visitor logs to us and you know, think about the time in 2006, 7, 8 where Hadoop like really came up to [00:10:00] say hey, you are collecting so much data for yourself that you are not able to process it in a single, you know, threaded environment.

Now this is a parallel processing system. We had to basically build a threaded So, we have a platform that would get all of our customers data on a daily basis or hourly basis and process it every day and give them insights on top of it. So, we had some experience with Hadoop and Hive at that time.

So, we use that platform as like our big data platform and then we use MySQL as our metadata layer. Store things like who is the customer, what products are there, you know, what, who are the users, etc. So there was a clear separation of small data and big data in the platform and then we had a bunch of data processing frameworks set up.

Initially, we started off with you know, everybody wants to run cron jobs, right? So yeah, like we had a So, I think [00:11:00] that's it for this video, and I'll see you in the next one. Bye. Bye. To run long running jobs and things on a in a particular sequence and then we have a concept of a DAG, which is like a directed acyclical graph.

So when we get our customers data. We have to, you know, process their web data, their CRM data, marketing automation, combine it, and then derive some insights. So, internally there is a graph created from it, and then I think initially we used I think one of those MESOS frameworks to build it. And then we actually built an in house framework like an Airflow.

And obviously over the years, right, as you said Mesos had died, but Kubernetes has come up as like the, you know, the de facto container orchestration system, so we migrated to it several years ago. Our [00:12:00] web services are running on it. Our whole DAG framework is running on it. We have hundreds of cron jobs, long running jobs, so we run like about 10, 000 jobs.

2 million containers a day like doing all the processing across our customers as a customer is on boarded. The DAG gets created, all their data gets collected on a daily basis, process, everybody gets their own mini DAG that gets run through the system. So we have a lot of data and a lot of processing now and you know, um.

Totally on AWS, so leveraging most of their infrastructure and, you know, running our Hadoop frameworks, our Kubernetes clusters on top of it. When you say

Madhukar: 2 million containers, I cannot even imagine how many EC2s or how many virtual machines you're running. But let's go back to the original because I kind of love how all of this started and how, where it is today in this massive scale.[00:13:00]

So the whole idea was, if I'm a customer, take my data. And I remember FTP. Boy, there used to be CyberDuck, I remember, on Mac, which was a much later product, but it allowed you to, you know, FTP your data. And that's how I would share files, even with myself, for a very long time. And then there was FileZilla and whatnot.

It's interesting that now, with the likes of S3, but also a bunch of other new technologies that came about, we didn't realize that FTP actually died. I don't know if anybody still uses FTP, but I remember there was a time where that was the only way that you could send files, very large files, over to different companies.

And then there is, of course, SSL and other secure way of sending files. So, in your case, Companies would FTP their data periodically, I'm assuming a lot of manual work. And then the cron jobs [00:14:00] would process that data. And then the output would be also data that goes back into an FTP folder for the customers.

Is that how it started off? Yeah,

Premal: so as you said, in the old, in the very beginning, you know, people, you know, the large companies, they used to just dump their data on a regular basis. But then we build our like web tag collection system. So now we have a JavaScript tag sitting on their website, data is continuously streaming in secure and all, then we do all that processing, right?

So we have to really understand, okay. The day is done. Now we need to process yesterday's data. Some of it we do in real time, some we do in batch, and then there are multiple ways of delivering, right? We have our own product, you can log in, you can see the data, we push the intelligence back into your CRM, where your sellers are living, we push it back to your marketing automation, so you can run campaigns off it.

There are certain customers who want to do More [00:15:00] BI you know, analytics on top of data, so we push them some raw data back to their FTP, to their secure servers, depending on what a customer wants, there are several delivery mechanisms for the data. What

Madhukar: was it like to go from your own servers to, then to cloud, and then eventually to containers and Kubernetes?

Like, how long did it take you to Do that

Premal: migration. I mean, pretty much from the beginning the only non cloud stuff we were doing was local development. You know, this is like 2013. We each bought a computer from or we built it sitting under a desktop sitting under the desk. So, we had a local network setup, so everybody's computer was part of the like the Hadoop distributed file system for testing locally.

But most of our, everything actually we were running was in AWS cloud from the very beginning. We were in Y Combinator, so we got a lot of credits to get started with [00:16:00] AWS. So, you know, did not have to spend out of company's pocket. And, you know, obviously it's just like setting up EC2 instances and like running your scripts on top of it and, you know, all the old school DevOps work that, you know, I used to do and then, you know, with Ansible automation and like now the new ways of setting up servers, obviously Kubernetes has made things really easy to set up.

We have many of those clusters running. The transformation to Kubernetes was about You know, five years ago where, you know, it was still early, so we had to learn a lot and, you know, I'm sure their APIs are still changing. But it was still, you know, coming from that Mesos world, knowing that, hey, this system works, going to a much stable, widely adopted container orchestration system was always on our mind.

And, you know, we have made that leap. Actually, we don't have Mesos running anymore. And, you know, we have many clusters of Kubernetes running, we are investing a lot in you know, scaling and going up and down and saving money and between [00:17:00] spot and on demand. So now that's what we invest time in and we are worrying about how to run containers.

Got it.

Madhukar: And you know, eventually, of course, I'd love to hear about how things are changing and what are you doing to prepare your team for, you know, whatever's happening with Gen AI and stuff. But in general, did you go from a monolithic architecture to microservices based or were you always kind of disaggregated and containerized friendly to

Premal: begin with?

Obviously, in the very beginning, we had two or three services only, right? And we had stuff running on EC2, and we had some deployment scripts, which would package the artifacts, ship it and deploy it. But pretty soon we realized that, hey, the world is moving to microservices, we need to make it easy for our developers to build and deploy stuff in the microservice environment.

So, you know. We started investing in [00:18:00] containerization and figuring out, you know, how we could deploy it. And at that same time Kubernetes was coming in, so with the, using Docker and Kubernetes, we were able to now blow up our custom words. Monolith into microservices and a lot of them and now each team is responsible for their own service and scaling and managing and like building and deploying the service.

So the confluence of technologies and you know, what you can foresee as being a challenges really helped in like making the transition to microservices. That's one

Madhukar: thing I'd love to understand. I remember when I was much closer to product development, at least when, you know, we were shipping products outside of cloud, it used to be a back end team, and then there's a front end team, and then there's, of course, the front end team had the design team, and so on.

How do you, first of all, how big is the engineering team? Are you able

Premal: to share? We have about 200

Madhukar: engineers. 200? [00:19:00] Yeah. So how do you organize a team like that? Is that Around feature set. Is that around what exists in the product today? Like, how do you think about organizing? Your team so that they're all working like a microservice, right?

Where they do their own modular stuff, but they all eventually connect at some point. Yeah. How do you think about the organization itself?

Premal: So we kind of largely divide the teams into a platform team and like an application team. I see. So platform teams are responsible for building the common frameworks that, you know, everybody would use doing the build and release stuff, building the API frameworks and security frameworks and logging framework and all that stuff happens on the platform layer and they are obviously taking requirements from the developers who are working on the applications that are customer facing to Make sure that they are not repeating the cycle on over and over again.

So then they can [00:20:00] just start a service and then they get all the scaffolding built in and the vertical teams or the application teams are divided into the, you know, various products that we sell. So we have largely the, we are like persona based, so we have a marketing focused product, we have a sales focused product, we have an operations focused product, so split them into those pillars right now.

And they also

Madhukar: have domain experience, or do they have experience? On a certain part of the product.

Premal: Yeah, so, obviously before joining Sixth Sense, a lot of people don't have any sales or marketing experience as a developer. But, you know, once they come in, they are like mostly working on their area.

Obviously, we try to rotate people too, so that they are like getting some variety across products and services. Even with the marketing, we have like sub areas. There's like advertising, there's like email, there's you know, reporting, analytics, so people [00:21:00] spend a lot of time getting, going deep into their area before they move on to something else.

Madhukar: And typically when you have a platform team, are they the one also responsible for let's say, the data aspect of it? Because I'm assuming the data aspect is pretty big are you able to share what's the size of data that gets generated or processed on a daily basis or a monthly basis? You

Premal: know I would.

Approximated to probably like few hundred or maybe like a hundred terabytes of data that we, you know, move around on a daily basis. Oh wow. Um, we probably get a few terabytes coming in and then. We process it and then we make, you know, derive different signals from it. So that blows it up.

So, yeah, it's a lot of data. Obviously, we don't keep everything, but we generate a lot and then we discard a bunch of it.

Madhukar: And is that also disaggregated across the different teams or is it the platform team that says, no, the data [00:22:00] aspect is entirely on the platform side and here are a bunch of governance and access tools?

Or am I thinking about it

Premal: differently? Yeah, so the platform team You know, including the big data team is responsible for making sure we have a healthy environment that people can run their jobs on, right? And then the individual application teams or even people from some platform teams can write their jobs and make it part of the DAG that I was earlier telling you about.

You know, they're like, Oh, I'm already deriving this signal. I need to do something additional on top of it. So they can either update it or they can add a new job and make it part of the pipeline. So it is, you know, the platform is mainly responsible for making sure we have a stable secure platform and then people can build on top.

Madhukar: And so if you're doing 100 terabytes a day, do you also have some, like at any given point of time, you have 8, 10, 100 petabytes of data and I'm assuming you continue to [00:23:00] archive stuff as well or, you know, sunset them, whatever you call it. So, what's

Premal: the total sizes? Yeah, we have like probably 10 to 15 petabytes of data sitting in you know, storage like S3 and, you know, HDFS combined.

Madhukar: So, if you were to categorize that data, like the 10, 12 petabytes of data, Would you say most of them are structured or unstructured, or? Fast moving or slow moving? Do you think about data in that way? Or do you think of it more from an application perspective? I don't really care. I, my application need is to get this data and I need it in a few milliseconds

Premal: or whatever.

Yeah, so we pretty much, I would say it's like mostly structured data. We convert it to structured. So once we get it in different forms from our partners and customers, you know, everything is like a high table. Right. I see. So then you should be able to query it, you [00:24:00] should be able to join it to anything else, and derive some signals on top.

Madhukar: Wow. So most of the data is in structured, and you said it's in Hive. Table. So, I'm assuming do you still use

Premal: MySQL or? Oh, we absolutely use MySQL. As I said, it's like our, what we call our small data ecosystem is MySQL, right? So, um, all our customers metadata is sitting in there, all the job metadata is sitting in there.

We still have a decently sized MySQL to run the, you know, the metadata layer. But all this data is sitting in, the big data is sitting in like Hive.

Madhukar: So majority of the data is in Hive, some in MySQL, I know you have Single store as well. Where does that fit in? Like, how do you think of the data architecture today versus in the future?

And why have multiple databases, so to speak? Hive, I kind of get it. MySQL, I also kind of get it. More [00:25:00] transactional use cases, like more rights. And then, of course, single store. Love to hear more how you're using it. But tell me a little bit about the data architecture landscape and how do you think about it from the end user's

Premal: perspective?

One way to consume data from Sixth Sense is go to our UI and you're a marketer, salesperson, operations. There are different UIs you can get in and do different things like create campaigns or create a segment of account or segment of people or Get some details about an account or the all the accounts that you own, right?

And

Madhukar: that's like a joint statement from a bunch of different tables under the

Premal: hood. Exactly right now. We want to make that experience really fast for someone to come into UI and get the data out. So traditionally we have

So we have, you know, a presto slash trino clusters running that can, you know, do the fast querying of data between [00:26:00] not only data just sitting in Hive itself or sitting in S3 along with HDFS, but also maybe do a join to MySQL at the same time and return that output to our customer. We have, there are certain UIs where we want to be like, even faster, where there's not a scope of a lot of exploration.

It is like precanned in many ways. Like the data. . So we put an edge base there. We said okay, you have to do a lot of like data engineering to say, okay, this is how the customer is gonna access it. This is how our a p is gonna access it, so let us store the data this way. So it is as fast as possible and, so we had those two ecosystems where, you know, if you had exploratory data, we would put you on like the presto layer, if it was more canned, like the edge based layer. But you could see that there was like data fragmentation. The same data was sitting in two different places. And if you update it in one place, you have to make sure you update it in the second place.

So that's where, you know, we brought in like single [00:27:00] store to say, let's just move all of our UIs. To one data lake and everybody gets a consistent view. There's only one copy. So we process everything on our hive and spark ecosystem, and then we take the, you know, subset of the process data, move it to single store, and that's the customer's access point.

Madhukar: I see. And when you say move the data, is it some sort of an ETL job that's running outside the databases, pulling the data? Doing something with it and then sending it over to a single store or is it like a pipeline inside a single store that is pulling the data and doing some transformations on the

Premal: fly?

Our Hive and Spark system are processing data, they're combining data sets, they are deriving signals. And once it's put into a final table in the Hive ecosystem, we ETL it into single store using pipelines or load data, whatever is best for that particular [00:28:00] source. So we have again, hundreds of tables in single store that we are pulling data into from different tables in Hive.

Madhukar: And what was the reason for single store versus doing all of this in Hive? MySQL or Presto as the front layer to MySQL and Hive.

Premal: Presto is great for doing exploration, but if you want like the Sub hundred millisecond response times for like complex queries. It definitely does not perform as well.

There are concurrency issues. You need a lot more hardware to get that really fast response times. MySQL cannot store the amount of data that we Output. So that's why we had the hive layer to store tables. So if we were to build those really fast UIs, you know, like we did for HBase, to do it with Presto.

That's why we had HBase going in. So it just did not work for the right use case. So that's why we wanted to it. Get something else that would work.

Madhukar: So, knowing, actually I have two [00:29:00] questions. One I'm very curious about because of my own personal experience. Let's go with that. So if I'm building an application today from scratch, right?

I choose MySQL or I choose Postgres. Let's first, for this example, let's say I use MySQL. At one point, what point do I start to think about how much data should it be storing, or what's kind of the point where I need to think about, okay, I need something else, like what are some of those triggers that go through my mind to say I have outgrown

Premal: my SQL?

One is if you are a multi tenant environment. Right, and you are pulling a lot of data from your customers.

Madhukar: So I have multiple customers accessing their parts of the

Premal: data. Yeah, exactly. And you would have customers with different sizes of data. And you don't really, you can't really Provision the right capacity all the time because you don't know how customer data is going to change, how it's going to improve.

Especially when you want to [00:30:00] store a lot of historical data and you want the customers to do historical analysis on the top of the data or even internally you want to do it, right? MySQL is great for doing Few million, a hundred million rights a day. But if you wanna do like billions of rows of insertion and be able to query it really fast, you're going to run into troubles.

And as you said earlier, you'll do partitioning and sharding and this and that's a lot of overhead. I know Facebook where you were mentioning some company you were working at Facebook had hired like some MySQL guru to Yeah. You know. Build out a very complex MySQL environment and make source code changes to MySQL, right?

So, you know, it's very evident that you are not going to be able to scale your business if you're storing the raw data in MySQL.

Madhukar: So, second part of the question, let's say, knowing what you know, everything today, if you were to, let's say, build Sixth Sense today, how would you choose your data architecture?

Premal: I would [00:31:00] not change a lot, right? Our Hive and Spark ecosystem is pretty strong. We are very happy with it. Obviously, we are upgrading it to the latest versions. They have their own strengths and weaknesses between Hive and Spark. You know, we love both of them. We have had experience. They are great for doing like the big data like manipulation and like processing.

We even do machine learning through it. So it's very versatile. We have written a lot of like custom functions. To process in during the processing, right? You know, like you can add a mysql function to say normalize a string. That's very simple thing that we've written in Hive, but we've also done very complex things that are possible in that ecosystem.

A MySQL still stays because we still need to store the metadata and, you know, keep it fast moving and do those small inserts. There's also a lot evolving in the Hive ecosystem in terms of the file formats. So we'll probably go to some You know, delta file format like a hoodie or [00:32:00] there's another one that is skipping my mind where we can actually make like little changes to files after we write them which is not available in the traditional hive or like a parquet or ORC system so that is It's definitely going to be a big data system, whether we run it in house or on some cloud, where we do run on a cloud, but whether we run it directly on EC2 or Kubernetes environment or EMR, that is something we would like again, look at and figure out where, what is the most cost effective, less maintenance for us.

And Then I would still figure out what is the best database for a customer to interact with the solution and how much interaction they need. And

Madhukar: which one, in your experience, works for most of the requirements?

Premal: Yeah, so, you know, we have extensively tested single store for a lot of our use cases. And when we were doing the research and we were in the [00:33:00] POC phase, we were working with two other vendors, maybe three other vendors.

And you know, everybody had their own strengths and weaknesses. Single store obviously stood out because of the HTAP capabilities. We were, we can do a MySQL within single store and not need a MySQL. We are not there yet, personally, because it takes time to get there. But we definitely found it great for running large scale analytical queries on top of the data and still return it.

You know, in a transactional time period, right? We can do very complex things, we can architect the data the right way, and we can achieve the right results, which are not possible with a lot of the traditional systems.

Madhukar: So, kind of a newbie question, because I'm not very familiar with Spark. So, how do I get it?

It's like a, at a very abstract level. Level, it's like a SQL table, so I can run SQL queries, right, but under the hood, it's like an HDFS.

Premal: Yeah, it's like a file system. It can be a CSV [00:34:00] file, whatever, and you can still treat it like a SQL, like a table.

Madhukar: That's beautiful. So, I have a very large set of data sitting in a file system ish, and I'm running SQL off of that, which is where Hive is Spark, can you explain it to me?

What does it do? What do you use it for? Is it like a Cron job, but in Java, or is it more like an end to end application that pulls the data, does something, and then deposits somewhere else?

Premal: There are a lot of similarities between Spark and Hype. And they actually work off the same metadata, right? So when you create a Hive table, there is a Hive metastore which tells it what are the columns and what are the data types, etc.

So Hive is kind of an execution engine on top of your files. Spark is a complementary execution engine. Now you can, you know, write a Spark SQL just like you write Hive SQL and they are pretty much compatible with each other with minor changes. Or you can actually [00:35:00] write Python or Java code in Spark and treat your files as data frames and you can do the similar operation.

So with Spark, you can, when you write code, you can now write unit tests and you can write like Other stuff around it that makes it easier to understand and maintain and improve and test, or you can use Spark SQL. So there are options with Spark and there are multiple languages you can write code in.

But basically the, it does the same thing as Hive in some way. It, you know, it runs execution on top of one or multiple tables and creates an output on top of it.

Madhukar: Got it. So. Maybe is it, so what I really like about what you just said is I can use Python and similar to Pandas. I can create data frame, but there is where I use Spark.

And the syntax for getting that data is through SQL, right? Which is coming in through Spark framework itself, [00:36:00] or I could choose

Premal: the Hive. Yeah, I mean, if you write Python code, you can just say, hey, here's a, you know, Hive table that, and now load it in a data frame. Here's another Hive table, load it in a data frame, join it based on this column, do this kind of aggregate, do this kind of operation, and write it back to this new table.

And you can write that in Python.

Madhukar: Okay, makes sense. Let's talk a little bit about the deployment side, right? Because I think what you just said blew my mind in terms of 100 million containers. I'm still thinking about it. How do you do, how do you do deployments, then? Or how do you do releases? Do you get a feature from a certain team and it's now in production as soon as somebody does a PR and it gets committed and it gets approved?

And then your customers have that feature? Or do you say Now I'm going to have another replication of my entire environment somewhere else, and then I have questions about what about that six or ten [00:37:00] petabytes of data, and then you accumulate all your releases, and then in a standard way, then move it into production.

Like, how do you think about it? Deployment at this, such a massive scale.

Premal: Yeah, so, developers start off with their own local development, right, and they create feature branches. Then they, once a bunch of developers on the same team have things ready to test in a dev environment, then they create a special branch, which builds containers for their services automatically.

And then they're deployed to that branch where they, you know, a bunch of testing happens, QA, spending its time, and now once a month, it's release time, right? And at that point we, you know, cut a release branch code gets merged into it And, you know, test it again. Automation and all that stuff is happening.

And then it gets merged into the trunk, containers get built, and now [00:38:00] developers do the deployments. There are also cases where there are auto deploys happening for our pipelines and stuff. But You know, generally we are just deploying on top of Kubernetes and there are like rollbacks and all those things are available.

But we generally coordinate our releases around a particular time of the month, especially for like big features. Things go behind feature flags, so not every customer immediately gets it. You know, some things go in beta, some things go in direct to production. So, there are different phases for different features.

Madhukar: when I think about, you know, millions,

Premal: yeah, and sorry, and you asked about data. So we don't have everybody's data in the dev environment. So a lot of it is also creating synthetic data and like testing it with different scenarios. And then we have our own data so we can use that as a guinea pig. And then we have like test environments that we have set up so we can simulate as much as possible.

For the different integrations, somebody has Salesforce, somebody has Mercado, Eloqua, HubSpot, all those environments can be

Madhukar: tested. [00:39:00] So when I think about millions of containers, to me, it's like almost like a beehive where you are maybe standing 100 feet away and looking at the whole thing. Do you have some sort of a thing like that where you say, okay, these are all my containers, there's an issue here, or?

Something seems to be going off over here. Let's go look at it. Is it like that, or is it a very different way of looking at where to look at the issues, how to, and then Assign it to the right engineer and get

Premal: that fixed. Yeah, so observability is the key, right? If you cannot see what's happening, you cannot fix it, you cannot understand what's going on.

We do have multiple Kubernetes clusters, but the web service cluster is not millions of containers, it's like a few hundred. Each service has multiple replicas and, you know, we use a bunch of tools like Datadog and other logging services. So we can have Visibility, and we have a lot of like alerts and like triggers set up for things that are happening.

[00:40:00] Pages going off, and we should be able to trace back exactly to where the problem is happening, who is responsible for it, page the right team, right escalation paths. We should be able to see the metrics, is it a container level issue, is it a database issue, code issue, whatever it is, and trace it back. To, you know, the line of code.

Madhukar: switching gears a little bit now, I'm thinking a little bit like a marketer or go to market person and we talk a lot about, hey, if you are a technology company, let's say a lot of the tech startups are technology companies, and they end up selling to other technology company like yours, how does the decision making hap, happen around what you end up buying, right?

So, for example, I remember when I was doing this stuff. By the time I'd left, it was about, okay, let me just go give it a try on my own. So I go look for a free trial, I try it, and if it works, I quickly build my own prototype, and the [00:41:00] next sprint, I go show it to my manager or the product manager, and then we say, okay, now we think it's the right solution.

Let's go talk to the sales team or something like that. Is that kind of sort of how it works in large engineering teams, or is the buying process, or Evaluation completely different. Like how does that

Premal: machine work? Yeah, it's a great question and I can speak to how we do stuff at Sixth Sense. Because with the advent of you know, all the privacy regulations and compliance and myriad of like Things that have propped up in the last several years is very important for our legal and security teams to understand what data is being sent, where, especially when you're using a SaaS platform, right?

Are you sending any customer data? PII, any PII or even customer names or [00:42:00] IDs or like email addresses of your users or anything if you are sending to a vendor that you are using that comes under heavy scrutiny. So we always have a gate even for doing POCs to first. You know, get the right people involved to understand how we're going to use the product in terms of the safety and privacy of the data.

And then it is about the scale and the cost and there are obviously build versus buy conversations. And, you know, if this is going to scale with the number of customers, it's going to scale with the number of engineers, it's going to scale with number of EC2 instances or Kubernetes clusters, you know, various different pricing models.

Right? Everybody has. So all those go into Consideration. Obviously, we talk to vendors, we understand, we know our pain points, we try to find two or three that we can do POCs with, make sure legal and privacy is aligned, [00:43:00] then we do the POCs, and then the contracting process starts. So,

Madhukar: is that you and the leadership team that decides that, okay, this is the direction we are headed, now let's go and do the evaluation, or is it more So, what is the CROWNED UP, USER, DEVELOPER UP, where they try it out, or where is

Premal: that?

Yeah, it is mostly a strategic direction for the, you know, engineering team saying, hey, this is, these are the problems we want to solve, and let's go and find some solutions. And some people are given the charter to go and find them. Sometimes developers will come with great ideas to say, hey, I, you know, I use this personal tool, but I think this is great.

If everybody else can use it, so let's So then we go into that evaluation too.

Madhukar: And so from your perspective what's like an ideal way of making a purchase or making a decision on a purchase? Do you think about it? Oh, I want to go talk to somebody and really have a deep dive conversation and a [00:44:00] POC for a month.

Or is that changing? Has it not

Premal: changed? I think it will depend on the product that we are buying. Let's say I'm buying a database that one of us has experience with and we know that it is the right use case for us, then we would try to short circuit that process as much as possible and like we know that these are the only two vendors we need to talk to.

If we are trying to go into a new market where we are saying, Hey, let's find a new age. Let's say logging vendor or new age observability vendor, then we have to really go and do deep dives with their sales and SE teams to understand why are they different, how will they help us solve the problem, will they actually save us money, or it's going to go the other way around, etc.

As obviously time evolves, people have experience with different things, so we would just like How can I try that as fast as possible? Is it available via AWS Marketplace? I don't have to talk to [00:45:00] anybody. So that's like the most ideal scenario. But if not, then you know, yes, running a sales cycle is fine.

Madhukar: So let's talk about what everybody's talking about for the last one year now. It's the last section of the, of this topic. Gen AI, how's that changing the Both the development and deployment process I'm sure, or maybe I'm not sure, but have a lot of people started using GitHub Copilot or PyCharm if you're using that IDE to generate code, and how is that changing, how fast you ship, the quality of the code, how is that changing, and where do you see that going?

Premal: We are we are definitely in the evaluation process right now. A couple of engineering leaders are taking that mantle and we want to improve developer productivity and, you know, figure out the right way to like ship and increase our velocity. So, we're also looking at vendors [00:46:00] to do the whole SDLC cycle.

I mean, I think GitHub Copilot is a great example. We already have using GitHub so we can add the licenses. So we are testing that out to see.

Madhukar: How do you find the quality of the code currently coming out of

Premal: GitHub Copilot? Um, I cannot speak to it much right now because I'm not that involved, somebody else is.

But I've heard good things in general from people who have tried it out and That's why we are investing our time. I think this year is going to be where we actually get it deployed as fast as possible and improve the time it takes to ship something.

Madhukar: Do you think it will change the process or the way you ship stuff in any way?

Premal: I think it will change, you know, The, definitely the boilerplate stuff, it will help people write tests better things that they don't think about. Maybe the code can give you a better [00:47:00] idea of what else you could be testing, what are the edge cases, and hopefully we can also inform the AI about like, how we would like to write code and, you know, it can help the developer adhere to a particular standard.

And make sure that now code can be readable across the whole company and not just by the three people who wrote it.

Madhukar: And from your end customer perspective, do you see a lot of asks or requests from your customers about Gen AI capabilities? Or do you see that change in pattern of how they currently use your product like Sixth Sense or even how they do their business?

Yeah,

Premal: so, For what is important for our customers, the Sixth Sense gives them the right insight and gives them the insight very quickly. So we are, you know, we have a lot of different products where people come in and like they infer the data from what we're showing. Now it is. Our [00:48:00] response is to help them do that faster.

So now we are bringing in GenAI to give them the right summary to help them to ask questions of the data right from within the product without having to, you know, think about it more or open a support ticket or ask their CSM. We want them to interact with AI, to say, hey, I look at this company or look at this segment of company, this is what I see, what is the meaning of this, or what can I do next, what is the right channel that I should activate based on what I am, what you are showing me.

So, customers are, you know, looking forward for ways where we can make their lives easier. So

Madhukar: from your perspective, somebody that has a very vested interest in the data itself, because that's what powers the company and the product, like, how do you think about the value of data? For you, what is more valuable?

Is it data that [00:49:00] just was born right now or data that was a year old? So recency of data? The second could be around the kind of data it is. The third is how accessible or, accessible as in, you know, can you access it through SQL, JSON, and other stuff. Like, how do you think about that changing the value of data?

First of all, like, how do you think about what is the value of data? And how do you see that changing with GenAI, if at all

Premal: in any way? Yeah, so we we put a lot of value on Historical data because that's what the machine needs to know to understand the trends that have happened in the past So that it can inform the future and help you do more better things in the future.

So for example You know your past historical website visitation data, like what are people doing on your website, right? What are they clicking on? What are the high value pages? Does that lead to converting into an [00:50:00] opportunity? When you open opportunities, what are you doing with them, the opportunity history?

So when we are onboarding customers, we are trying to get as much historical data from them. And then obviously the third party vendors are sending us historical data so we can help the customer immediately with the value of like Historical data that we have collected and then as new data is coming in real time as people are visiting their website We can like immediately tell them who it is how they can act with that particular visitor so All data is important, right?

We have static data like your CRM accounts and leads and contacts, but yeah, all your intent data that is coming in Advertising data, third party intent data, from like big and small partners, so it's a big you know Data lake, when I heard that word the first time [00:51:00] well, yeah, we do have a data lake and it, you know, we have to now manage it to make sure it does not spill

Madhukar: over.

I mean, it's very interesting because I've been thinking about similarly what you just said. So, just to paraphrase, at least the way I understand, what you're saying is, the richness and the volume of data is primarily important If you just had very scanty data about somebody, then that's not very useful.

So you want to have the richness of the volume. And then on top of that, you want to have the recency of the data. What did that person just do, which gives me the intent is that.

Premal: Yeah, exactly right. If you're a very new company, you don't have enough opportunities, you don't have enough accounts to go after then, you know, you would not be able to use certain part of a product.

You would not, the machine learning piece would not be the best thing for you. You should do something else. But if you have you know, have a lot of historical data, then we can derive better insights. We also have you [00:52:00] know, ways to fill in the blanks and infer things based on what we have seen across our customer base to help someone who does not have enough, which will only go so far.

When you think

Madhukar: about generative AI or large language model, are you thinking about using it within your own Engineering process as well around your data do you ever think, oh, because we have 10 petabytes of data, we should eventually have our own large language model, or do you think about, no, I'm going to use this with an existing open AI or something like that?

And then embellish that data and use it for something different. Have you started thinking about that in any ways?

Premal: Yeah, it's more of the latter. You know, we our data science team does a lot of experiments with the external systems like OpenAI, Llama, etc. And now there are so many options available in the market to run.

LLMs in the cloud and like they will host multiple layers. So we [00:53:00] just don't want to take on that time and effort and the cost to figure out what to do with RAG you know, helping improve the quality of the output. We are just like doing a lot of experiments to see what is the best tool to use out there to solve a problem that we have.

Madhukar: And do you personally or from your team, do you have a verdict on? Which model is still the best, like between open sources as well as the commercial

Premal: ones? I think we are still seeing OpenAI's model be like the best for most of our use

Madhukar: cases. That's what I keep hearing from others as well. And it's not just the quality, I think it's just the whole product, the quality of the product, like the features they add and how quickly they add.

Just the quality of results that come back. That's pretty interesting. But you mentioned, you looked at Lama too, like the 70 billion model

Premal: as well. I think so, yeah, the team experimented and then, you know, we were figuring out, okay, now we have to, how do we deploy it and how to make sure it runs at scale.[00:54:00]

And they're like, okay, it's OpenAI is giving similar or better results. The cost might actually be the same or even less. Let's just go with that. And, you know, if things go out of hand, we will, we'll keep doing things in the background to understand what else can we use. We'll always have a backup ready.

Right? OpenAI, thankfully, did not implode with all that craziness that happened a few months ago. But, that's something we have to be ready for, right? If some, their pricing model changes or the quality of the output changes we always should have something else that can step in place. Yeah, I mean,

Madhukar: that's where I feel where the world is headed.

It's pretty interesting in terms of, you know, how we went from monolithic to microservices. I see that a lot of this is moving towards agent oriented with agent being you have access to one large learning, one platform. Large language [00:55:00] model, and you have your own knowledge or your own data source, and then you have your own tools.

So then you have one agent that is specialized for doing something because, just like you were saying earlier, it has its own depth of knowledge, but also the recency of data that is coming in. So it's, I'm particularly very interested to see where that goes. Since you mentioned RAG this probably would be the last question.

How do you see that? Do you have plans of using vectors, semantic search, and do you have plans of evaluating how to change or evolve your architecture to accommodate for, or to, you know, be ready for RAG?

Premal: So, one of the things that we are like we already have We create a Slack bot for our support team.

So our, when a customer asks a question support team will add the, ask the bot the question, which will do nice rag search on like the [00:56:00] content that they have created. Yeah. And give them the answers along with few links, and we give it to the customer. So that's just the internal testing. But eventually the phase is the customer actually access, has access to it, right?

, all our internal teams, scs CSMs. But now even the customer can ask questions instead of opening support tickets, right? And that comes from all the knowledge base that not only we add to you know, what we call the RIO CT, which is our knowledge base, but also every other customer who is adding knowledge to it.

So it's coming from our own data set, but it's answering the question the right way.

Madhukar: So are you using vectors for that, I'm assuming? Where are you storing those vectors? How are you creating those vectors? Yeah,

Premal: so thankfully we have single store, so we just, we don't have to procure anything else, right?

You know, we started experimenting and I don't think we have still gone to the latest 8. 5 ANN, but they are using 8. 1 somewhere, and Getting the right results and vectorizing and asking, [00:57:00] um, then sending it to OpenAI.

Madhukar: Just to be clear, I was not I was not aware that SingleScore was it, but I was hoping that you knew that it has vectors and you can do mix and match of data right in one SQL statement.

And the other nice piece is it has pipelines that can bring in your existing incidents and issues and vectorize it and put it into the knowledge base as well. So when somebody's asking, you have the freshest of data, like if there's an incident going on, you're able to tell them, oh, by the way, tell the customer that there's, you know, this.

That's currently going on, and you can answer questions related to that. We will look into that. Awesome. Well, thank you so much, Premal, I've learned so much. Um, couple of quick questions, and then we can end the show. What's your favorite code Editor? Is it vs code? Is it

Premal: Vim? Oh, I'm a JetBrains Java id, which has a Python plugin, so like I can do

Madhukar: everything.

Python, charm and [00:58:00] WebStorm. The whole

Premal: JetBrains. Yeah. Yeah. So, but so luckily the Java editor now has Python plugin, so I don't need to open both. Oh, nice. That's been around for maybe a couple of years. Um, so I do all my editing in their Java code. Python code, sql. And

Madhukar: JetBrains, I saw that they also added like the AI feature, very similar to

Premal: CoPilot.

Yeah, I have not used that yet. You know, fortunately, I don't have to do a lot of coding these days, but I'm just like reviewing code and looking at stuff. It's called IntelliJ, the name came to me. I

Madhukar: see. Okay. Second question is, If you were to tell somebody in school and college what to focus on when coming into an engineering job today or even two years from now, what are the three things you would tell that that kid to know and learn and be really

Premal: good at?

I would say being kind of like a full stack person is pretty important these days. You should be able to [00:59:00] understand the concepts of data and storage and you know, at least the basics. Have a backing database to build an application on top of it, able to write some backend APIs, backend code, and then build a decent looking UI on top of it, that actually gives you an idea of what is involved end to end.

In like building an application this is just again a web application right there obviously you can do mobile and this and that But you know versus being just focused on I only do x versus y You need the versatility. A lot of employers are looking for that and it also helps you personally you know, you can be independent you can be the Person that people can go to and say hey I need to Build the full cycle.

You are the person behind it. And I think these days it's probably important to understand, you know, what's happening in the AI world. And at least being able to experiment and play around with a little bit. I'm sure a [01:00:00] lot of tools are available for people to try out, to write code, to do shopping lists or do like planning and itinerary.

So being familiar with what's happening in the world with the new edge technologies. Very

Madhukar: good. And then last question. What do you like to do most outside of

Premal: work? A lot of decompression, hanging out with the kids, watching some, you know, TV shows, watching sports, sometimes go and hit up the go karting track.

Madhukar: Very good. Well, thank you so much, Premal. I think this was phenomenal. I've learned so much. I'm going to go back and then take some notes because there were some really good nuggets out of that, but really appreciate it. And like I said, Genuinely, I love the product and what you folks are doing. So looking forward and

Premal: rooting for you.

Thank you. Appreciate that. And we love your product too. [01:01:00]

Madhukar: All right.