WEBVTT 00:00.000 --> 00:11.760 All right, our next speaker is Evan Rusakis, who's going to present how Apache Superset 00:11.760 --> 00:14.760 reinvented and re-engineered its world documentation. 00:14.760 --> 00:16.520 Please give a warm welcome to Evan. 00:16.520 --> 00:22.040 Thanks for joining. 00:22.040 --> 00:26.840 Just curious if anybody has not heard of Apache Superset here, I'd love to see a hand. 00:26.840 --> 00:27.840 All right, fantastic. 00:27.840 --> 00:29.360 You've justified my existence. 00:29.720 --> 00:35.360 All right, this is a long title, but I wanted to share some of the needs and projects 00:35.360 --> 00:39.080 and learnings that have led to a better documentation set up for our project. 00:39.080 --> 00:42.680 I'm Evan Rusakis, I work at a company called Preset. 00:42.680 --> 00:49.920 It's like a managed service version of Apache Superset and a PMC member, which is a project management 00:49.920 --> 00:55.520 committee person, and I work on the docs, and I'm tired of doing it, and I read them 00:55.520 --> 00:58.880 sometimes, but mostly lean on AI, like everybody else these days. 00:58.880 --> 01:03.360 So trying to make things easier for those training models on our behalf. 01:03.360 --> 01:08.840 There's my contact info if you want to follow up afterward, but very quickly, this is not 01:08.840 --> 01:13.120 a product talk, but I'll just tell you what Superset is real quick. 01:13.120 --> 01:18.560 It's a very advanced BI tool for data teams that want to just democratize access to their 01:18.560 --> 01:23.960 data and visualize it in sensible ways and explore and share the insights they find. 01:24.040 --> 01:29.320 By GitHub Stars, the Apache Software Foundation's biggest project, it's got a lot of contributors. 01:29.320 --> 01:34.920 There's been about 350 TRs this month, it's a very active project. 01:34.920 --> 01:38.600 And it's got a SQL workbench, which allows you to really connect to pretty much any data 01:38.600 --> 01:42.920 source under the sun, and write all kinds of queries, share them with your team, then you 01:42.920 --> 01:48.040 can drag and drop those columns and build visualizations, those visualizations, build dashboards, 01:48.040 --> 01:51.920 or do all kinds of drilling and filtering and cool stuff with. 01:52.720 --> 01:58.400 Then, of course, preset, who was kind enough to fly me here, is a managed version of that 01:58.400 --> 02:03.520 that adds a bunch of bells and whistles, and you can have multiple instances of Superset 02:03.520 --> 02:08.880 for all sorts of different purposes. If you want to try Superset, this isn't easy way to do it 02:08.880 --> 02:13.360 for free, not here to sell stuff, it's an open source conference. So let's get back to talking about the 02:13.360 --> 02:19.520 docs. You can check out the repo here, you can check out the docs themselves, it's all live on the site, 02:19.680 --> 02:26.080 obviously. And when we talk about rebuilding our documentation, this is one of the first questions 02:26.080 --> 02:32.800 that always comes up. Can't you just have AI do this? And we sure tried just for sport, but in the end, 02:32.800 --> 02:40.720 it did not go very well. You have AI that just kind of dumps out a whole bunch of garbage. Really, 02:40.720 --> 02:46.080 it's, you get all these fancy mermaid diagrams, but all the unimportant details kind of come 02:46.160 --> 02:50.320 front and center, and all the little nuance things that are very important to humans and administrators 02:50.320 --> 02:57.440 of this kind of product just get ditched, you know, somewhere, very deep down in the docs or 02:57.440 --> 03:02.080 they're hard to find. Sure, there's a lot of pages, a lot of words, but it doesn't really help people 03:02.080 --> 03:07.040 that much. So you're the right one, you know where the product is headed, you know what's important 03:07.040 --> 03:11.360 to people that use in the ministry of your product. So you should have a lot of say in how these things 03:11.360 --> 03:16.640 are built. So the hot take here is that, yeah, you're the one that knows what your docs should be, 03:16.640 --> 03:22.560 and what they should say, and machines are really the ones that are good at writing code. And the 03:22.560 --> 03:28.480 point of this talk is that you can have AI write code to write your docs. So I wanted to share some 03:28.480 --> 03:32.400 of the hackathon experiments that I've been working on to kind of prove the point to myself and the team. 03:34.000 --> 03:39.520 So it all started with a little bit of road mapping. We had a project that 03:40.480 --> 03:45.840 we had to reinvent the docs for, I'll get into it. But step one is assessing the mess. In our world, 03:45.840 --> 03:52.880 we had all kinds of stuff that was scattered all over. We had wikis, we had, you know, emails and third-party 03:52.880 --> 04:01.040 blogs, and just read me files in dozens of places on the repo. Everybody kind of created their own 04:01.040 --> 04:06.880 little scattered bit of historical information somewhere, and institutional knowledge was really 04:07.360 --> 04:13.360 what people would lean on, far too much. So the idea was to kind of clean all this stuff up, 04:13.360 --> 04:20.240 get it all under one roof, and make it better than it's ever been. The key problems to solve are 04:20.240 --> 04:27.360 that, you know, since if you can't find everything, the on-wrap for new contributors is incredibly 04:27.360 --> 04:34.160 difficult, or for new users. Search is limited. AI even doesn't have that great of a singular 04:34.160 --> 04:39.360 knowledge base to refer to when it's doing training runs. There's a lot of duplication of effort 04:39.360 --> 04:44.080 because if there's multiple places, things are being documented, guess what you've got to do. And then, 04:45.360 --> 04:49.760 the worst part as a contributor is whenever you write a new poll request and get some code merged, 04:49.760 --> 04:53.200 now you have to go write the docs for that thing, and that's a total drag. Nobody wants to do it. 04:53.920 --> 05:00.400 So this all just comes at giant mess, where the code base is a moving target, the docs can't keep up. 05:00.480 --> 05:07.680 So what do we do? Do you let that code base determine what you should do really? You get everything 05:07.680 --> 05:13.520 under one roof, federate all your content first, get everything cross referenced, and then you try 05:13.520 --> 05:19.040 to get the docs to build themselves. Your code is probably full of little implementation details 05:19.040 --> 05:25.600 and metadata and all this stuff that's really useful. So we'll capitalize on that. And then, of course, 05:25.600 --> 05:31.280 you want to optimize for humans that actually read the docs, and then, of course, for the AI training 05:31.280 --> 05:36.480 models that people are building. You know, these foundational models are trained on open source, 05:36.480 --> 05:43.680 and that's whole other topic. It's great. So I wrote up this big proposal. We call it a 05:43.680 --> 05:49.040 sip, a superset improvement proposal, and it was to build a new developer portal because we're building 05:49.040 --> 05:54.560 a whole new extension architecture and superset, which is awesome. But for all these new features, 05:54.560 --> 05:59.200 we want to make sure that developers are using them so we have to go make it easy to do. 05:59.200 --> 06:05.520 First up, find the right platform. Another good use of AI is doing your homework for you and finding 06:05.520 --> 06:09.840 all the platforms that exist and where they fall short for us that turned out that actually 06:09.840 --> 06:14.480 docu-saurus picked all the boxes when you add all the fancy plugins. So we rolled with that. 06:16.240 --> 06:22.640 And speaking of getting rolling, sweeping up. First thing you get to do, you've got old docs. 06:22.720 --> 06:28.160 So go ahead and let AI switch through and find all the spelling mistakes, add all the cross links. 06:28.880 --> 06:33.760 Just, it does all the heavy lifting of pulling your wiki over and all that very easily. So you 06:33.760 --> 06:38.880 could just kind of get organized and have a good. Here's all my stuff version of the documentation. 06:40.480 --> 06:47.760 And then you've got to look for the fun part. The opportunities to make your documentation build 06:47.760 --> 06:54.480 itself. The code in many places is self-documented. So you want to look for those repeating patterns 06:54.480 --> 06:59.760 and probably rewrite parts of your code itself so that it can be leveraged by your documentation. 07:02.400 --> 07:08.800 AI is really great at turning metadata into pages, but not just saying here's some metadata 07:08.800 --> 07:13.040 spit out a bunch of words that actually having it write the code for docu-saurus to render 07:13.040 --> 07:19.600 to those pages or any other documentation tool you're using. So let it write scripts because we all 07:20.400 --> 07:24.880 are probably using AI to write code every day. I know I hardly write code at all anymore be 07:24.880 --> 07:36.480 an honest. So just use it to do that. So first test, we have this mapping visualization 07:36.480 --> 07:41.920 in superset, one of a few different map visualizations where we have to have all of the 07:41.920 --> 07:47.600 countries of the world represented and it parses a bunch of geojson stuff. And if you go and 07:47.600 --> 07:53.200 mess with this gigantic Jupyter notebook, then you've got to go and update the actual plugin 07:53.200 --> 07:57.760 itself to add the country that you might have added and then you've got to go update the docs. 07:58.320 --> 08:06.560 So it's easy enough to actually have the Jupyter notebook update the code for the visualization 08:06.640 --> 08:12.080 plugin and update the docs. And I was like, oh, that's cool. Now the contributor could just 08:12.080 --> 08:17.120 do one little thing in a notebook and the product and the documentation just take care of themselves. 08:18.320 --> 08:24.240 So let's expand on that. Future flags are something that we're a nuisance. 08:25.040 --> 08:29.840 Previously we had this Markdown file on the repo and every time somebody added a future flag or 08:29.840 --> 08:36.960 changed its status or its default you have to go and update this thing and nobody over does or wants to. 08:38.000 --> 08:43.120 So it falls out of date all the time and led to funny bug reports and stuff. 08:44.000 --> 08:50.480 So I went in and I added a bunch of comments to the config file. So you've got to this meaningful 08:50.480 --> 08:55.120 stuff about what category the flag falls into, what is default status, what status it's in 08:55.120 --> 09:00.160 if it's future flag life cycle as we get rid of things and then it builds these pages. 09:00.720 --> 09:07.200 You've got a super long page all very organized of what is set, what way what it does, how long it's 09:07.200 --> 09:12.400 going to be there and that's very handy and you never have to touch that documentation file again. 09:13.280 --> 09:19.360 API docs. Everybody's got an API. Everybody's seen this thing, the swagger renderer. 09:20.000 --> 09:27.520 We've had that in our product forever, never loved it. So Wambam, DocuSource, magic, 09:27.520 --> 09:32.400 lots of plugins and all of a sudden you've got this very interactive stuff with code samples, 09:33.200 --> 09:38.720 all the response objects from your API, the parameters, all the good stuff developers actually need 09:38.720 --> 09:46.960 on a very interactive playground set sort of place. Now databases are something that we care a lot about. 09:47.120 --> 09:55.760 Superset connects to a whole bunch of stuff and what it does as a product is essentially just 09:55.760 --> 10:01.040 use some translation layers to send SQL to them from the database. You've got a SQL 10:01.040 --> 10:05.600 library, you're writing the queries and then you get data back and then we visualize it. It's 10:05.600 --> 10:10.720 actually pretty straightforward when you oversimplify it like that. But in the actual stack, 10:10.720 --> 10:14.960 there's this top layer called the DB engine spec that sits on top of SQL alchemy dialects 10:14.960 --> 10:20.080 and that's where we do the stuff that superset cares about like documenting time, 10:20.080 --> 10:26.000 granularity is another little peculiarities of databases that make them special and make them work 10:26.000 --> 10:33.680 with Apache superset. So this is our old documentation. This was hand-edited stuff on the Docs 10:33.680 --> 10:40.400 site until just a few days ago, honestly. It was out of date, some of these connection details 10:40.480 --> 10:45.520 were incorrect and nobody could ever answer the question of how many databases do we support 10:45.520 --> 10:52.640 and nobody wants to go and clean up these Docs. So we also had this logo wall on the home page of 10:52.640 --> 10:57.440 the site and just like how did we pick these databases to have logos on the logo wall? Why are these 10:57.440 --> 11:05.840 ones important? So what I did is then went through and added with AI a bunch of metadata to 11:05.840 --> 11:14.240 every one of the DB engine spec files that we use in superset. And that means we can all of a sudden 11:15.040 --> 11:21.360 also take advantage of these DB engine spec details about time greens and other features and 11:21.360 --> 11:26.960 all of the custom error messages that they respond. All of a sudden you get this lovely index page 11:26.960 --> 11:32.000 that tells you exactly how many databases you support and you can search and you can sort by the 11:32.000 --> 11:36.720 type they are or what features they support and all kinds of other stuff. So you get this lovely 11:36.720 --> 11:42.320 table that if you're looking for your database you can figure out which one might be a good fit for you. 11:43.280 --> 11:47.280 All of a sudden you get these great documents that have truthful and up to the date 11:48.000 --> 11:54.320 information on how to connect to them and even what all their little errors and peculiarities are. 11:56.240 --> 12:01.600 The newest one I just merged quite recently is about a re-entstory book. I don't know how to 12:01.680 --> 12:08.080 do your front end developers are here. But in our particular product we've got a Python back end 12:08.080 --> 12:14.160 react front end and we've got a million react components many of which are based on AntD but also 12:14.160 --> 12:19.920 several other libraries. We've had this react story book like so many people have seen sitting there 12:20.640 --> 12:24.960 collecting dust so to speak. Nobody ever actually does npm run story book to see what your 12:24.960 --> 12:28.960 components do. They just kind of go into the code and figure out what they could do the best they can. 12:29.680 --> 12:36.800 So if nobody's going to leverage it might as well build it into the docs. It turns out 12:36.800 --> 12:42.240 there's a whole bunch of plugins you can use to make this fancy and build it into docusaurus. 12:42.240 --> 12:45.520 You just have to have AI go and update all the real little story files. 12:47.040 --> 12:53.280 Then you have fully interactive examples just like story book but even better you get this live 12:53.280 --> 12:58.480 code editor. You can't do a story book as far as I've seen to just type some code and 12:58.480 --> 13:02.960 fill it with your components. You get all the props and everything you need to know how to import it 13:02.960 --> 13:09.120 and even links to edit the documentation that the story itself. Then of course the best thing you 13:09.120 --> 13:14.880 can do for open source is tell the world you use it. So we have this in the wild page which used to 13:14.880 --> 13:20.400 be a Markdown file. Nobody knew it existed. Therefore nobody updated it but why would you update it 13:20.400 --> 13:29.280 if nobody can find it. So I changed it from a Markdown file to a animal file with the help of AI 13:29.280 --> 13:34.400 and then a little docusaurus magic and all of a sudden we have this new in the wild page where 13:34.400 --> 13:41.680 you can slap some logos on it. You get the little user faces from GitHub and it gives it a very 13:41.680 --> 13:50.240 high profile page on the website and you even get a little crawling logo wall on the front 13:50.640 --> 13:57.760 page as well. So that's a nice bonus. Then this is one I'm halfway through right now which I'm 13:57.760 --> 14:02.800 dying to finish screenshot updates. Everybody has screenshots of their stuff in their docs and 14:02.800 --> 14:09.600 there's such a pain to update because you're constantly changing the UI on things. So we're using 14:09.600 --> 14:16.240 playwright to test stuff and superset right now and turns out playwright can take screenshots. So if you 14:16.320 --> 14:23.680 actually find the right part of the DOM on your site to take a screenshot at the right time and 14:23.680 --> 14:30.240 the right state you can just have the script run and take screenshots of all the things you need. 14:30.240 --> 14:35.680 Copy the files into your docusaurus site and then your screenshots will always be correct and 14:35.680 --> 14:40.800 we've added versioning to all of our sections of the docs. So whenever you cut a new version 14:40.800 --> 14:45.360 it copies all those old files over and they'll be locked at the right place in time. And then as 14:45.360 --> 14:53.840 you keep changing things your next version will always be up-to-date. So speaking of next, what are we 14:53.840 --> 15:00.160 doing? Superset now supports theming so you can make the product look like whatever you want, 15:00.160 --> 15:05.360 look like your brand great for embedded analytics and all sorts of purposes. So documenting those 15:05.360 --> 15:10.320 creating a playground, how to build them and leaning on all the libraries we're built around so 15:10.320 --> 15:14.160 that all of that documentation builds itself even when we upgrade all of these foundational 15:14.160 --> 15:22.080 packages that can be done. We've got this extension effort which is kind of a big deal. We've 15:22.080 --> 15:27.120 taken a lot of inspiration from VSCo where you can add plugins anywhere that do anything and we're 15:27.120 --> 15:32.480 actually kind of riffing on their architectural plan of how that works. So you can add a bunch of 15:32.480 --> 15:37.360 bells and whistles and a bunch of different places in Apache Superset coming so that's why we 15:37.360 --> 15:45.360 built this new developer portal. And the extensions are starting to happen. So these dots are kind 15:45.360 --> 15:52.560 of half human written, half AI written, but the real neat and potatoes of it for automation 15:52.560 --> 15:57.920 sake is actually the extensions themselves. People are publishing them on NPM and right now we have 15:57.920 --> 16:03.920 this little mark down table we're building because it's all very new. But obviously the extensions 16:03.920 --> 16:08.480 when they get loaded they have a JSON file kind of like a package JSON and we can put as much 16:08.480 --> 16:13.520 metadata in there as we want including your screenshots and descriptions and compatibility matrix 16:13.520 --> 16:18.880 and whatever other licensing and security details we start to care about and then this page which 16:18.880 --> 16:24.320 is right now hand edited will go away and be automatic. So as the ecosystem builds and suddenly 16:24.320 --> 16:29.440 we go from 10 extensions to thousands of them it's all just going to show up there and be up 16:29.520 --> 16:41.760 to date all the time. So yeah AI this is where open source has a huge advantage. It's almost 16:41.760 --> 16:50.720 unfair really. Open source is the the best substrate for using or training AI everything about 16:50.720 --> 16:55.920 your project. The people, the code, the design patterns, the history, the arguments that happen on 16:55.920 --> 17:00.400 get of all of that stuff has been just sitting there on the internet and they're drinking it up. 17:01.200 --> 17:08.240 So AI knows everything about you and now your job is to make the documentation and the public 17:08.240 --> 17:15.680 facing stuff regarding your project is comprehensive as possible so that the next training run will 17:15.680 --> 17:22.080 include more of it and be more useful to people. So you've got to help humans they need to know where to find 17:22.160 --> 17:29.040 things but you know you've got to make sure that things are always current for them and 17:30.240 --> 17:41.120 the goal is to not have to maintain as much as the code base grows. So we have a million 17:41.120 --> 17:46.800 little helpers on our repo right now because we're open source all these people are basically 17:46.800 --> 17:52.800 don't even their service to us for free which is fantastic and you have an AI chat on the home page 17:52.800 --> 17:57.440 itself which is very good but all of these things are actually training on the doc site and 17:57.440 --> 18:03.040 fine tuning constantly so the more we updated the more they know. So this stuff is already helping 18:03.040 --> 18:07.360 and by the way there's a talk tomorrow with the GitHub thing if anybody's coming to that but I'll 18:07.360 --> 18:11.760 be talking about these guys. They're actually starting to talk to each other which is a total trip. 18:11.760 --> 18:22.400 So yeah and conclusion here I guess the point of this story is to not let AI just run away 18:22.400 --> 18:28.720 and write a billion words about your product. That doesn't really do any service for AI that's 18:28.720 --> 18:32.880 going to train on that later. Doesn't really do any service for your users that are trying to 18:32.880 --> 18:40.400 read it and find the important parts and the nuance and the details. So yeah use AI but use it to 18:40.400 --> 18:46.960 write code and use it to change your code so it can build the documentation and just you know 18:47.920 --> 18:57.200 don't be too lazy about that. So ultimately if you put in the work and do these migrations 18:57.920 --> 19:05.120 the docs will build themselves more and more and your life will become easier. So that essentially 19:05.200 --> 19:12.720 is the crux of my time. I love some time so if anybody has ideas, questions, whatever I would love to 19:12.720 --> 19:18.320 hear about it. I've also got stickers for anybody to do on some. 19:20.720 --> 19:24.960 Cool. Five minutes if anybody's got burning questions or ideas or whatever. 19:35.680 --> 19:42.560 This kind of doc's rebuilding project. See the moment you had something you could close to the world. 19:42.560 --> 19:48.320 How long could approximately take it? You can publish something in one day. Some parts were harder 19:48.320 --> 19:54.800 than others. Doing the in the wild page where you could just seal the faces and logos all of a sudden 19:54.800 --> 20:01.040 took a couple hours. But doing the storybook thing where we have hundreds and hundreds of stories 20:01.040 --> 20:07.120 and they all need to be refactored in some way. That was a lot of you know monkey in the middle 20:07.120 --> 20:12.640 testing and nope that didn't fix it, nope that didn't fix it stuff. So it depends but you can just 20:12.640 --> 20:17.760 chip away at it. Those were a handful of projects or projects rather that I wanted to start with but 20:17.760 --> 20:23.680 there's plenty left and I'll just try to make as much of it as generated as possible in the near future. 20:31.040 --> 20:39.360 So far the feedback from the developer community has been. The question was about the feedback 20:39.360 --> 20:45.520 from the developer community was their pushback or general acceptance or excitement and so far 20:45.520 --> 20:51.040 the it's been excitement. We have answered questions we didn't have answers to before about 20:51.040 --> 20:59.440 what databases we support how many of them all that stuff. The optics in terms of partnership 20:59.520 --> 21:03.520 have gotten better because now we're surfacing logos for all of these different companies and 21:03.520 --> 21:09.920 different databases. They get a link. They get better SEO. We get better SEO because people are 21:09.920 --> 21:14.800 able to search for these things. Does superset connect to database x? Well yes it does. The answer is 21:14.800 --> 21:21.840 there. So being that much more comprehensive makes us much more findable as a project. 21:22.800 --> 21:29.360 It makes the site more comprehensive and pretty and you know it's almost like the old 21:30.240 --> 21:34.960 web rings. We just link to everything now. They link back to us. It's fruit to a cycle 21:34.960 --> 21:38.800 and it's all growing very quickly. But the developers are stoked because nobody has to maintain 21:38.800 --> 21:43.840 database docs anymore. Nobody has to maintain more and more parts of the docs. 21:44.880 --> 21:47.840 It's taking care of itself. Anyone else? 21:47.920 --> 21:54.480 I don't. You showed it somewhere where like to find out which database you can connect to 21:54.480 --> 22:01.840 the app. It's made up of it. Mm-hmm. And then make it discoverable in some form, right? 22:03.440 --> 22:10.240 It looked like it was stuck strictly from what you stated, but they're like they do 22:11.200 --> 22:19.040 languages. How actually did it? Is it like a study? An analysis over like a day? 22:20.800 --> 22:23.040 Okay. Yeah. Yeah. How has it actually worked? 22:23.040 --> 22:31.040 Yeah. And it's a link about for instance. Yeah. So the question was kind of about how 22:31.040 --> 22:36.240 it works. Like is it saw some type script? There's also some Python. So what languages 22:36.240 --> 22:41.760 it in and out? How does this transformation and build process work? And when you start up 22:41.760 --> 22:49.040 docusaurus, it gives you the ability to just run a bunch of scripts along with it. And so 22:50.320 --> 22:57.120 half our code base is in Python, half of it's in type script, running back in. So it can really 22:57.120 --> 23:04.160 merge any of that stuff. It's pulling in yamophiles and JSON files and Python files and all kinds of 23:04.160 --> 23:10.480 stuff. And it's just a collection of little scripts. So for the database, the DB engine specs, 23:10.480 --> 23:18.640 there's metadata and a Python file. For in the wild page, it's yamoph or maps, it's type script. 23:18.640 --> 23:23.440 And just each one of these little scripts has this job to just chew through all the metadata 23:23.440 --> 23:29.200 files, build an index, build the individual pages, put all the links and logos and everything in the 23:29.280 --> 23:39.280 right place. So it just runs all of them in sequence. How do you set up the review process? 23:39.280 --> 23:46.080 Like every AIDOM, you have like a review rounds? So yeah, the question was about the review process 23:46.080 --> 23:57.760 and how we manage that. And really the pull request has a preview build. We have one bit on the 23:57.840 --> 24:03.360 pull request. You actually get a preview build of the site. So you can just click it and click around on it. 24:03.360 --> 24:10.720 And I'll be adding some visual regression testing as well. But there's no AID involvement in these 24:10.720 --> 24:16.880 and when it builds actually because the contributor just edits some metadata and then what happens is 24:16.880 --> 24:24.480 deterministic. So the AID, I guess the crux of the argument is that you shouldn't let AID be building your 24:24.480 --> 24:30.240 docs. It doesn't in our case. It's really just that we're using AID to build more and more tools 24:30.240 --> 24:34.480 so that deterministically the docs are scripted to build themselves from the code base. 24:36.480 --> 24:41.360 Yeah, a bunch of import and layout builder scripts and all of that. 24:42.640 --> 24:45.040 All right, I think that's my time. Thank you all very much. 24:45.040 --> 24:53.600 Thank you very much. Thank you.