WEBVTT 00:00.000 --> 00:11.000 All right, for the next presentation, we've got Vladimirov talking about lightweight XDP 00:11.000 --> 00:12.000 profiting. 00:12.000 --> 00:14.000 So take it off. 00:14.000 --> 00:15.000 Hello, everyone. 00:15.000 --> 00:16.000 Can you hear me? 00:16.000 --> 00:17.000 Yes. 00:17.000 --> 00:18.000 Yes. 00:18.000 --> 00:19.000 Perfect. 00:19.000 --> 00:20.000 I'm Vladimirovaskali. 00:20.000 --> 00:22.000 I'm a PhD student at Sabienza University. 00:22.000 --> 00:26.000 And I'm actually here doing a listening video that you see 00:26.000 --> 00:31.000 with Professor Tomberbetter and speaking of Professor Tomberbetter. 00:31.000 --> 00:38.000 To my view, my interest that is actually open postdoc position. 00:38.000 --> 00:44.000 So if you're interested, if you do networking smart nix and a speedpocket processing, please do contact him. 00:44.000 --> 00:51.000 Otherwise, I'm here to talk about inspect, which is a lightweight profiling system for 00:51.000 --> 00:56.000 ITP application, which outperforms both inefficiency and profiling accuracy existing 00:56.000 --> 01:00.000 ABP profilers. 01:00.000 --> 01:10.000 While ABPF is often used to monitor other programs and for kernel tracing, and bugging the suite of tools capable of 01:10.000 --> 01:16.000 profiling the kernel part of the ABP application is limited with a pair of MPF tool 01:16.000 --> 01:19.000 paying a key role among them. 01:19.000 --> 01:26.000 So we evaluated five different XP application without profilers attached and measure their packet 01:26.000 --> 01:28.000 processing rate. 01:28.000 --> 01:34.000 And as we can see, most prominently in the drop application, which is a simple application 01:34.000 --> 01:36.000 simply dropping every packet. 01:36.000 --> 01:48.000 The drop goes from 15 million packets per second to less than four, which is a deep drop in performance. 01:48.000 --> 01:56.000 And can also make it hard to profile fast network functions such as also not dummy once 01:56.000 --> 02:05.000 like drop, but also CMS NASA and other can make them more challenging to profile. 02:05.000 --> 02:07.000 But how do profilers work? 02:07.000 --> 02:13.000 They rely on a set of specialized data to a register called performance monitoring counters, 02:14.000 --> 02:21.000 to track data about the different hardware events happening in the system. 02:21.000 --> 02:27.000 Most prominently, the extraction cycles, cache sheets, misses, and many others. 02:27.000 --> 02:33.000 To profile the kernel part of the ABP application, Perth and BPF tool attached to programs, 02:33.000 --> 02:40.000 the affinity and effects it around the target programs to read the PMC values before and after 02:40.000 --> 02:46.000 then they compute the difference and store the results for the user space to be able to read it. 02:46.000 --> 02:53.000 These affinity and effects it functions, although very fast, and probably the fastest one to 02:53.000 --> 02:59.000 wrap around the program, are still very expensive, completionally expensive, 02:59.000 --> 03:05.000 at the introduce as significant overhead as we saw before. 03:05.000 --> 03:15.000 This profiling overhead comes mainly from the effects it, the effects it and the effects it 03:15.000 --> 03:23.000 fentry and effects it functions, but as well comes from the Perth BPF Perth read value function, 03:23.000 --> 03:31.000 which is the one used by Perth to gather data from the PMC system, which is pretty slow. 03:32.000 --> 03:38.000 So these functions and these profiling functions get called for an HTTP application millions of 03:38.000 --> 03:46.000 time per second, which ends up drastically distracting the throughput. 03:46.000 --> 03:54.000 Not only the throughput is disrupted, but also the profiling accuracy, because as we see in the 03:55.000 --> 04:00.000 drop application it should be only around two instructions setting the action to drop a 04:00.000 --> 04:12.000 return in maybe a bit more, but Perth, the result of analyzing a drop program through Perth, 04:12.000 --> 04:21.000 says that this program has executed 627 instructions. This is because the Perth profiler 04:21.000 --> 04:27.000 considers also some of its own instructions used to profile the program before calling the 04:27.000 --> 04:33.000 read value again the second time. So computing the difference adds up some some of its 04:33.000 --> 04:39.000 own instructions instead. To solve this problem we develop the inspector, a lightweight HTTP 04:39.000 --> 04:46.000 profiler, which has three main components, a user space components that is set up, that 04:47.000 --> 04:55.000 tells the kernel part on which CPU events to record, two tracing macros to the limit 04:55.000 --> 05:02.000 of the profiling section, the part that you are actually interested in, and a kernel module 05:02.000 --> 05:09.000 to read the PMC's values efficiently called from the tracing macros instead of using the 05:09.000 --> 05:16.000 Perth function. So these are the tracing macros, start trace and then trace, they can be 05:16.000 --> 05:24.000 placed inside the any HTTP program, whenever you want it can be useful and can be used 05:24.000 --> 05:32.000 to profile some blocks of code or even individual instructions. In the, both traces, both 05:33.000 --> 05:39.000 macros, read the PMC values and store them, but the start trace is also managed to reach 05:39.000 --> 05:44.000 an activation if you actually want to profile the set region, and the sampling rate that 05:44.000 --> 05:50.000 will talk about in a few minutes. The entries instead, that's the difference between the 05:50.000 --> 05:58.000 start trace function and stores the results in any BPF map for the user space. But how do we 05:58.000 --> 06:07.000 access these values? So BPF has a little distraction set, and does not allow the use of native 06:07.000 --> 06:14.000 Excel, it seeks instructions such as RDPMC, which are needed to read the PMC values very 06:14.000 --> 06:22.000 fast and efficiently. This is why Perth has to use the BPF Perth or it value well per function 06:22.000 --> 06:29.000 to access them. To overcome this limitation, we developed, we developed a Linux kernel module 06:29.000 --> 06:37.000 that exposes a key function that invokes this RDPMC function, and retorts the value from 06:37.000 --> 06:50.000 the, from the counter itself. So by calling the macro directly from the HTTP program, we 06:50.000 --> 06:57.000 remove the need for effect, sit and defend, which saves us about 200 distraction for each 06:57.000 --> 07:06.000 call. Then directly accessing the PMC through the key function saves us another 200 07:06.000 --> 07:13.000 entity instruction, because we are not calling the BPF Perth or it value. As we can see in 07:13.000 --> 07:18.000 this graph, on the left we have the regular Perth profiling, on the right one we have 07:19.000 --> 07:24.000 where we still have around 40 instructions for each macro, which are mainly due to the 07:24.000 --> 07:31.000 call to the key function. These are some of the programs we used to evaluate our 07:31.000 --> 07:36.000 profiler. The drop is the dummy application that simply adopts everything. Accounting 07:36.000 --> 07:43.000 sketch is used for monitoring flow traffic and stores data inside a pretty big map, 07:44.000 --> 07:50.000 and not translate the BP, which is pretty similar to the tunnel, which does some 07:50.000 --> 07:56.000 imp encapsulation and router, which is our obvious, that simply that looks up 07:56.000 --> 08:03.000 in a pretty huge LPM3 to access the routing information. 08:03.000 --> 08:12.000 So as I told you a little bit before, there is some profiling inaccuracy when the profiler 08:12.000 --> 08:17.000 is actually pretty heavy, like Perth and BPF tool, because they add some of their own 08:17.000 --> 08:22.000 instruction to the instruction count. Here in this graph we can see the retired 08:22.000 --> 08:29.000 instructions that the application says that the profiler says the application is composed 08:29.000 --> 08:38.000 of or is actually running. All of this should be the same, but they are clearly not, 08:38.000 --> 08:44.000 because Perth and BPF tool are pretty heavy, and at more noise inside the profiling 08:44.000 --> 08:49.000 result. For example, the case study that drop application on the simple one that we 08:49.000 --> 08:56.000 showed you before, we expect around 2 per 600, which say about 40, so we are not 08:56.000 --> 09:04.000 perfect ourselves. This inconsistency and accuracy also happens in other 09:05.000 --> 09:11.000 while profiling other metrics, such as cash misses, eats, whatever, because it can happen 09:11.000 --> 09:17.000 that these events are caused by the profiler itself, not by the application under 09:17.000 --> 09:25.000 test. These are the results of the throughput while our application is the 09:25.000 --> 09:31.000 very supplication, are attached to different profilers compared to the baseline. 09:31.000 --> 09:39.000 Here we can see that our application, the inspector, is quite a bit better than the 09:39.000 --> 09:47.000 other profilers, but still have the performance of most of most XP application. 09:47.000 --> 09:53.000 The worst happens in drop, because since it is a very lightweight instruction that 09:53.000 --> 09:59.000 the weight of the profiler is proportionally high. To mitigate this problem, we implemented 09:59.000 --> 10:08.000 the sampling functionality that will increase performance by maintaining good 10:08.000 --> 10:18.000 enough results, let's see. Our sampling mechanism is composed of a simple counter 10:18.000 --> 10:24.000 that checks if the packet is in the actual sampling period. If it's in the sampling period, 10:25.000 --> 10:32.000 we check the PMCs and computer results and store them and do all the regular stuff. 10:32.000 --> 10:38.000 To somewhat fairly compare against the perfe, we also implemented the similar functionality 10:38.000 --> 10:47.000 inside the perfe. This sampling functionality, which was not supported, is still not supported. 10:47.000 --> 10:56.000 The time of doing this work for perfe starts. 10:56.000 --> 11:03.000 However, the performance gains from perfe are limited, because it still has to 11:03.000 --> 11:09.000 check to call the aventry to check if the packet is in the sampling period. 11:09.000 --> 11:15.000 This call to the aventry is still pretty expensive, even if you are not doing the perfe 11:15.000 --> 11:24.000 value event function to get the values. Instead, since we do this sampling inside the macro, 11:24.000 --> 11:29.000 which is pretty lightweight and pretty fast, we can have some better results. 11:29.000 --> 11:37.000 Almost reaching the baseline without no profiling attached. If we sample every 64 packets, 11:38.000 --> 11:49.000 let's say we reach almost the performance of a non-profile application while maintaining good enough results 11:49.000 --> 12:00.000 and accuracy. In this case, we are counting L1 cash misses. We get good results and expected results. 12:00.000 --> 12:08.000 So, to recap a little bit during our work, we identified the main sources of order at the specific profiler, 12:08.000 --> 12:14.000 which resulted to turn out to be the aventry, effects it, and perfe value. 12:14.000 --> 12:24.000 Then we develop this, so this function are necessary if you actually want to provide the application as easy, 12:24.000 --> 12:29.000 without modifying it, because you are looking around the application set. 12:29.000 --> 12:40.000 If you want to modify it, or you prefer any better performance, you can keep most of this function by using, 12:40.000 --> 12:49.000 calling the profiling itself inside from inside the XTP, and using K-function to access the PMC more efficiently, 12:49.000 --> 12:55.000 then we also implemented the sampling functionality to further reduce overhead. 12:55.000 --> 13:04.000 So, turns out that inspects, again, in this case, perfe, it's 71% faster without any sampling functionality, 13:04.000 --> 13:12.000 and 122% faster with sampling, sampling against the sampling velocity. 13:12.000 --> 13:24.000 But more importantly, we get 73% less instruction, no, it's at least while doing the test with instructions, which is pretty good. 13:24.000 --> 13:29.000 So, thank you very much for the attention, I'd be brief about that. 13:29.000 --> 13:34.000 Thank you. 13:34.000 --> 13:37.000 Good time for questions. 13:37.000 --> 13:39.000 Thank you for the talk, very interesting project. 13:39.000 --> 13:47.000 So my question is, if you can do the profiling mechanism like this for XTP profilers, 13:48.000 --> 13:51.000 can you adapt this to regular BPU programs? 13:51.000 --> 13:55.000 So, is the overhead basically the same or very different? 13:55.000 --> 14:00.000 Supposedly, we could, because the problem is being able to call data function, 14:00.000 --> 14:04.000 we should be able to call from every BPU program. 14:04.000 --> 14:11.000 The problem is that the performance gain are much higher in XTP, 14:11.000 --> 14:15.000 because the application gets called millions of times per second. 14:15.000 --> 14:22.000 So, a different BPU application might not get the same benefits, 14:22.000 --> 14:29.000 because probably it's called less, let's say, or if it's called the same number of times, it could be useful. 14:29.000 --> 14:31.000 Thank you. 14:31.000 --> 14:44.000 By the way, I love that B flying from the behave to the flower. 14:44.000 --> 14:46.000 Thank you. 14:46.000 --> 14:50.000 I was wondering, like instead of the K-Fank, 14:50.000 --> 14:54.000 did you also consider to implement that instruction natively, 14:54.000 --> 14:59.000 as a one-to-one mapping in BPU by extending BPU instruction. 15:00.000 --> 15:05.000 Because I mean, that would still give you even better performance, 15:05.000 --> 15:11.000 I would expect, and are you planning to submit some of that upstream? 15:11.000 --> 15:15.000 So, there is the idea of doing that. 15:15.000 --> 15:26.000 Maybe doing it like the GET timer function that's in BPU helper, 15:26.000 --> 15:30.000 so we can call this function and get the results. 15:30.000 --> 15:34.000 Without all the infrastructure around it, 15:34.000 --> 15:37.000 but of course the infrastructure is still a bit needed, 15:37.000 --> 15:42.000 so it becomes harder to do something like that. 15:42.000 --> 15:46.000 Have you been measuring the sampling overhead 15:46.000 --> 15:49.000 over the testing mechanism? 15:49.000 --> 15:52.000 What's the sampling mechanism itself? 15:52.000 --> 15:54.000 What's the overhead of it? 15:55.000 --> 15:58.000 It's pretty low. 15:58.000 --> 16:02.000 The overhead, let's see, this one, 16:02.000 --> 16:05.000 depending on the sampling rate you are doing. 16:05.000 --> 16:08.000 The black line is the baseline, 16:08.000 --> 16:13.000 so anything below it's our result, 16:13.000 --> 16:16.000 and this is the inspect without sampling, 16:16.000 --> 16:19.000 so we gain this much if we sample every. 16:19.000 --> 16:20.000 This is to be found. 16:20.000 --> 16:23.000 Something or it's with something one-to-one. 16:23.000 --> 16:25.000 This is without sampling, 16:25.000 --> 16:28.000 and this is by sampling every eight packets. 16:28.000 --> 16:32.000 Yes, but I want to know the performance if I add sampling, 16:32.000 --> 16:35.000 but I sample every trace, 16:35.000 --> 16:40.000 so then I know what's overhead of the sampling mechanism itself. 16:40.000 --> 16:45.000 Ah, okay, not so we didn't do this kind of test. 16:45.000 --> 16:46.000 Thank you. 16:54.000 --> 16:56.000 Hey, thank you for the talk. 16:56.000 --> 17:00.000 I was wondering if you know what the extra work 17:00.000 --> 17:02.000 that BPS per feed value is doing 17:02.000 --> 17:05.000 beyond just calling the native instruction. 17:05.000 --> 17:07.000 So it's mainly pretty heavy 17:07.000 --> 17:12.000 because it uses file descriptors to access these values 17:12.000 --> 17:15.000 as a reading of high let's say, 17:15.000 --> 17:20.000 and that's the main overhead that happens in this function. 17:20.000 --> 17:21.000 Thank you. 17:24.000 --> 17:27.000 Any more questions? 17:27.000 --> 17:31.000 I had a question. 17:31.000 --> 17:35.000 Is there any of the changes that could be brought to PF 17:35.000 --> 17:39.000 or to BPS to make them faster based on the work you've done? 17:39.000 --> 17:41.000 Yes, I know. 17:41.000 --> 17:46.000 The no-party is because since we are calling a key function, 17:46.000 --> 17:50.000 it's a relative to our machine, 17:51.000 --> 17:54.000 and it would be hard to ask someone else 17:54.000 --> 17:57.000 to include a key function inside the kernel 17:57.000 --> 18:01.000 and call a key function not knowing what it actually does. 18:01.000 --> 18:03.000 So that would be the problem. 18:03.000 --> 18:09.000 There could be a way of doing it without calling a key function 18:09.000 --> 18:15.000 but by simply calling the per feed value function 18:15.000 --> 18:19.000 inside the BPS program instead of calling the key function. 18:19.000 --> 18:24.000 So you can remove the per feed, the fendry 18:24.000 --> 18:27.000 and the effects it's overhead. 18:27.000 --> 18:30.000 But it's still pretty heavy to call that function. 18:30.000 --> 18:34.000 So the games would be marginal, let's say. 18:34.000 --> 18:35.000 All right, thank you. 18:35.000 --> 18:38.000 Have you run your tool in production? 18:38.000 --> 18:43.000 Have you used it so far mostly for experimenting? 18:43.000 --> 18:46.000 Or have you tried actually running that on production? 18:46.000 --> 18:50.000 No, it was just experiments like this. 18:50.000 --> 18:53.000 No actual production testing. 18:53.000 --> 18:56.000 It would be interesting to get some benchmarks on that. 18:56.000 --> 18:58.000 Okay, thank you. 18:58.000 --> 19:01.000 Someone there has a question. 19:01.000 --> 19:03.000 All right, thank you. 19:03.000 --> 19:04.000 Thank you. 19:04.000 --> 19:07.000 Thank you.