WEBVTT 00:00.000 --> 00:12.240 So yeah, for those of you who are not familiar with VLN, VLN is basically started as the open 00:12.240 --> 00:20.080 source projects from a couple of students from the Berkeley and it kind of exploded very quickly 00:20.080 --> 00:25.280 because it offered a really like a nice and efficient implementation of inference for modern 00:25.280 --> 00:33.040 LLMs and then over time it became one of the very very large and vibrant communities where 00:33.040 --> 00:38.000 like PhD students, researchers, engineers and also like big tech companies are contributing 00:38.000 --> 00:44.000 very actively and trying to develop like inference engine which is going to serve these modern 00:44.000 --> 00:50.880 LLMs as efficiently as possible. The cool part about VLN is the fact that it runs and supports 00:50.960 --> 00:56.640 various hardware so you can run it on and VLN GPUs, AMD GPUs, Google TPUs, AWS, 00:56.640 --> 01:02.000 Neuron, Intel Gaudi and so on and so forth and usually for most of these hardware vendors like 01:02.000 --> 01:07.520 the teams from these companies are helping out kind of developing you know, trying to make VLN run 01:07.520 --> 01:13.840 as efficiently as possible in their hardware and then on top of that VLN supports running all 01:13.840 --> 01:20.320 like popular modern LLMs these days and usually model vendors are working with VLN kind of behind 01:20.320 --> 01:24.480 the scenes to always try to enable day zero support for their models when they are about about 01:24.480 --> 01:33.040 a real system. So VLN offers a ton of like optimizations by default and like you will get 01:33.040 --> 01:37.760 already like pretty good performance if you just deploy your model with the standard configuration. 01:37.760 --> 01:41.840 However there is always like the traditional kind of question of trying to figure out 01:41.840 --> 01:46.960 whether we can squeeze in more performance and this is where quantization and speculative decoding 01:47.040 --> 01:52.080 became like very attractive topics for us because they offer the additional pathway completely 01:52.080 --> 01:57.680 orthogonal to all optimizations that already exist in VLN to accelerate inference with these like 01:57.680 --> 02:04.240 very, very large models. So just to bring everyone to the same page just one slide on what 02:04.240 --> 02:08.640 quantization is all about. So quantization is a basically a process through which we are reducing 02:08.640 --> 02:13.280 the number of bits that we use to represent either weights or activations of the model. 02:14.000 --> 02:19.040 What I mean by this is that if we take a neural network and we plot all of the weights in the 02:19.040 --> 02:24.720 network, we're going to get this like a normal Gaussian like her. And what's going to happen is that 02:24.720 --> 02:29.600 the weights are going to be distributed across this curve and they can take any possible value here 02:29.600 --> 02:34.000 and given that this is the full precision and by full precision usually these days we mean 02:35.200 --> 02:41.040 the weights are represented in brain floats 16 for but that's that became a standard these days. 02:41.120 --> 02:46.240 We have very high granularity so we can represent like very, very small differences between the 02:46.240 --> 02:50.880 weight. And through the process of quantization what we're trying to do now we're trying to take 02:50.880 --> 02:55.040 these weights and we're trying to put them in these discrete buckets and each of these buckets 02:55.040 --> 02:59.840 represent one single value that we can represent in this now a new quantized range. 03:01.040 --> 03:07.600 Because this like a quantization this this quantized range here has much much lower granularity 03:07.600 --> 03:11.840 it this means that during this process we're going to have to shift the weights a bit. 03:11.840 --> 03:16.480 For example if two weights were very close to each other in the original full precision regime 03:16.480 --> 03:21.520 we're going to most likely one be able to represent their difference in the quantized range 03:21.520 --> 03:25.440 so we'll have to put them in the same bucket. For some weights we'll have to move them to the 03:25.440 --> 03:31.040 left to the right and so on and so forth. All of this to say that quantization is not the 03:31.040 --> 03:35.760 lossless process. So this means that there will be some loss of the precision and the entire 03:35.760 --> 03:40.640 game and entire research here is all about figuring out algorithms which are going to enable us 03:40.640 --> 03:46.240 to do this while preserving accuracy of the model as much as as possible because we still don't 03:46.240 --> 03:53.920 want to destroy our model in order to get some efficiency gains. So let's take a look where 03:53.920 --> 03:58.400 quantization fits inside the server and then based on that we'll try to figure out which quantization 03:58.400 --> 04:03.600 schemes to use in which in which in which situation. If we look at our server like at the like 04:04.560 --> 04:10.640 the very bottom of this like a pyramid we have CPU memory. This is like our main memory it's 04:10.640 --> 04:17.200 like very large but it's also very very slow. On top of that we have our GPUs. GPUs have 04:17.200 --> 04:21.600 something called like high bandwidth memory or HBM and this is what you see when you run 04:21.600 --> 04:26.800 NVIDIA. So my for example and the main characteristics of this layer is that it's much smaller 04:26.800 --> 04:31.840 but it's also much faster than the CPU memory. And then at the very top we have GPUs 04:31.840 --> 04:36.800 RAM and tenser cores. So this is the smallest part of the memory but it's also the fastest one. 04:36.800 --> 04:40.480 And this is basically the part where the computation happens. This is where matrix may be 04:40.480 --> 04:44.880 some applications actually happen and this is how we are doing the inference like with our model. 04:45.520 --> 04:49.840 So if we try to look at what's happening when we are like trying to do one forward past 04:49.840 --> 04:53.520 our model. So we want to put in some tokens for the model and generate an answer. 04:54.240 --> 04:58.320 The first part that we'll have to load the model. So we have to move it from the CPU memory 04:58.960 --> 05:03.440 to the HBM and this is something in most normal scenarios we do only once. 05:03.440 --> 05:09.840 We load the model to the GPU memory and then for n times what we're going to do like for 05:09.840 --> 05:14.240 every single operator in the network we're going to take like one weight matrix. We're going to 05:14.240 --> 05:20.640 load it from HBM inside tenser cores. Do some computation with that right back results and repeat. 05:20.640 --> 05:26.880 And this is something we're going to do for every single like a linear operator in the network 05:26.960 --> 05:31.760 and then what's going to happen is that this is the sequence of the operations that we're 05:31.760 --> 05:35.760 going to repeat many many times for every single token for every single letter in the network. 05:37.360 --> 05:42.080 The reason why I'm just kind of specifying here like loading torch and a linear is because 05:42.800 --> 05:49.040 even though the LMCs days they have many different types of layers linear layers or linear operators 05:49.040 --> 05:54.240 are the costly pieces of this. If we were about to plot like where we spend time during the forward 05:54.240 --> 06:00.000 pass all of the linear operators this is like the self-attention and the MLPs part. 06:00.000 --> 06:04.000 They will take majority of the time here and they are the main target for optimization in general 06:04.000 --> 06:09.920 here. So these are the two main components and candidates that we actually want to accelerate 06:09.920 --> 06:16.560 through a quantization. So the first part is an acceleration of the loading. So given that the 06:16.560 --> 06:21.360 loading part happens like when we're transferring the weights from HBM inside the tenser cores 06:22.240 --> 06:27.200 there is nothing else we can do on the hardware level because all GPUs come with a specific bandwidth 06:27.200 --> 06:31.760 they're nothing we can change there. However the only thing so that we can do there is 06:31.760 --> 06:36.800 reduce the number of bits to load and those bits are the bits of our weights. So that's the first 06:36.800 --> 06:40.800 kind of quantization scheme that we are working with that's called weight only quantization. 06:41.520 --> 06:45.920 The second part is the computational part which actually happens inside the tenser cores and this is 06:45.920 --> 06:50.320 where matrix matrix multiplication operations happen and they happen inside tenser cores. 06:51.040 --> 06:55.920 So the only way to accelerate that is to find faster tenser cores. And if we take a look at 06:57.040 --> 07:03.840 technical specifications of an H100 GPU we'll see that by default we are usually because most 07:03.840 --> 07:08.800 models are in brainfolded. 16 format we are operating in brainfolded 16 tenser cores which you 07:08.800 --> 07:15.040 really have around 2,000 steraphlobs. And then if we look in this sheet there are also two more tenser 07:15.040 --> 07:20.960 cores called fp8 and in-tait tenser cores which offer two times more flops per unit of time. 07:20.960 --> 07:25.520 So the main goal of this part is now to push all of the matrix matrix multiplications 07:25.520 --> 07:30.400 to happen inside fp8 or in-tait tenser cores instead of brain float 16 because now we get 07:30.400 --> 07:35.440 two times more flops per unit of time. And the way to do this now is to have both operands of 07:35.440 --> 07:40.720 the matrix matrix multiplication quantize either to fp8 or in-tait. And this is where the second 07:40.800 --> 07:47.280 scheme comes. Which is called like weight and activation quantization. And then these are 07:47.280 --> 07:52.720 like the two main paradigms in the quantization field. And whenever we want to like quantize and 07:52.720 --> 07:58.320 optimize the model like we have to pick one of these two like and that's going to like like 07:58.320 --> 08:03.120 depend on the exact scenario that we're interested and we'll see later on how to choose this. 08:03.120 --> 08:07.920 Now I'm going to just very quickly like go through some code samples just to like show 08:07.920 --> 08:12.880 showcase how how relatively simple this is at this point in time. As part of the real 08:12.880 --> 08:17.120 and project we have a library called LM compressor which is basically a library that implements 08:17.120 --> 08:21.520 a lot of like state-of-the-art quantization algorithms which you can just use from from the 08:21.520 --> 08:26.400 from the python interface. And it's relatively simple to apply this. You just load your model. 08:26.400 --> 08:31.840 This is the standard hugging phase transformers like we load load our model with dot from pre-trained. 08:31.840 --> 08:35.840 And then we instantiate the quantization modifier. We say okay we want to target all 08:35.840 --> 08:39.920 linear layers because as I mentioned before those are the costly pieces of the inference. 08:39.920 --> 08:44.960 And then the scheme that we want to do for quantization is let's say in this specific case FPA dynamic 08:44.960 --> 08:48.960 scheme we call the one-shot method we provide the model, the recipe and that's it. 08:50.000 --> 08:55.280 This is like a very simple pipeline where we don't use any calibration data. So here now we come to 08:55.280 --> 08:59.280 the second like like a choice that we have to make when we're doing quantization. And that's 08:59.280 --> 09:03.200 the choice of whether we're going to do quantization with or without calibration data set. 09:03.360 --> 09:07.760 Calvation data set is basically a set of tokens that we pass through our model 09:07.760 --> 09:12.800 to try to simulate the forward pass through the model and see how the model responds to this. 09:12.800 --> 09:18.720 So because LMS these days have one major problem and that problem are outliers inside 09:18.720 --> 09:24.160 activations. Outliers in activations are basically that's like a phenomenon where some specific 09:24.160 --> 09:29.760 channels have order of magnitude higher values than other other activations in the same in the same 09:30.240 --> 09:34.400 layer. And this makes quantization very hard because if we have one extreme large value and a lot 09:34.400 --> 09:39.120 of very small values, if we do like quantization top of that, this very large value is going to push 09:39.120 --> 09:43.440 all these small ones to like to zero. And if we do that, we're going to destroy the accuracy. 09:44.080 --> 09:51.120 So then we have to address addresses and kind of the path to get there is to take some 09:51.120 --> 09:55.920 calibration data set pass it through the model and see where these outliers appear in the model 09:55.920 --> 10:00.640 and then based on that applies to apply some tricks. And this requires some kind of calibration 10:00.640 --> 10:06.880 data set and this kind of here example here shows how how relatively easy it is to do the 10:06.880 --> 10:10.480 quantization with LMS compressor, but now just accounting for the fact that we want to have a 10:10.480 --> 10:15.120 calibration data set as well. First we load the model, the standard way as we load 10:15.120 --> 10:20.240 any agifist transformer, then we have already prepared like based on like many years of research, 10:20.240 --> 10:24.480 we already prepared like a proper calibration data set. Usually if you have a model 10:24.480 --> 10:29.920 fine tune for a specific task, you want to pick calibration data set from that task so that you 10:29.920 --> 10:34.880 have in distribution tokens. If you're just like quantizing like a general model, then this should 10:34.880 --> 10:39.920 be a good enough enough data set to start with. We just tokenize this and then we create a 10:39.920 --> 10:45.120 quantization recipe and here I'm trying to show you that one nice feature of LMS compressors 10:45.120 --> 10:51.360 that you can combine multiple different quantization algorithms together. Here I'm just like 10:51.360 --> 10:57.280 throwing in smooth quant, which is an algorithm to deal with the outliers in the activations. 10:58.080 --> 11:02.480 And then the standard quantization modifier, I want all linear layers, but now I have changed 11:02.480 --> 11:08.000 the quantization scheme to quantize the weights, to 8 bits and activations to 8 bits. And also I can 11:08.000 --> 11:12.800 ignore some specific layers if I really want. One common practice is to ignore quantization of the 11:12.800 --> 11:18.960 final LMS head because usually this takes a significant impact on the accuracy of the model. 11:19.520 --> 11:24.160 And then we call the one shot, we provide model recipe and like contrary to the previous example, 11:24.160 --> 11:28.160 now given that we have calibration data set, we also have to provide the data set and the number 11:28.160 --> 11:33.680 of calibration samples. And what is this is going to do is going to pass the calibration tokens through 11:33.680 --> 11:38.800 the model, look how activations behave, where the outliers are and then based on that apply 11:38.800 --> 11:44.000 smooth quant and the wrong linearest quantization and provide provide us with a quantization model. 11:44.320 --> 11:50.480 This pathway is computationally more expensive, right, because you have to have some access to 11:50.480 --> 11:55.280 GPUs kind of to do the forward pass. The previous one was relatively easy because it did not 11:55.280 --> 11:59.440 depend on any calibration data set, so we have been able to do that even on a CPU. 11:59.440 --> 12:03.600 LMS compressor has a couple of nice already built-in features to deal with this, 12:03.600 --> 12:08.400 whether you have a single GPU multiple GPUs can do the sharding, CPU floating and all these 12:08.400 --> 12:16.080 like modern tricks to deal with that. So given that we are as I mentioned before, we are quantizing 12:16.080 --> 12:21.040 a model, this is a lossy process, the main kind of the first question that we should ask ourselves 12:21.040 --> 12:26.080 is how well are we recovering the accuracy of the original model. And here I'm just kind of we have 12:26.080 --> 12:31.280 like a pretty long line of research where we're trying to show how to do quantization in a different 12:31.280 --> 12:36.080 in different scenarios for different models. And so on but here I'm just kind of like like a 12:36.080 --> 12:42.000 cherry-picked just one example to show how these accuracy recovery looks like in practice overall. 12:42.000 --> 12:47.920 So what we're looking at is the like an entirely family of the deep seek R1 distilled models. 12:47.920 --> 12:52.240 So from Lama and Quen deep seek took these models, fine tuned them for some reasoning 12:52.240 --> 12:57.200 reasoning tasks. And then we are looking at their average performance across AIME and F500 12:57.200 --> 13:01.760 and GPU alignment, which are the standard kind of reasoning task to do this. And we are looking 13:01.760 --> 13:08.160 at four models, BF-16 is the unquantized baseline represented by the gray, gray bar, F-PA-188 and 13:08.160 --> 13:13.200 in four represented by a blue, green and yellow colors represent just the different quantizations 13:13.200 --> 13:18.720 in that we apply here. And one like most important point kind of to take away from this is that 13:20.160 --> 13:26.080 blue and the green bars here are usually very, very close in most cases almost indistinguishable 13:26.160 --> 13:32.720 from the gray bars, which means that F-PA-188 quantization if done properly and calibrated properly 13:32.720 --> 13:38.880 should always yield accuracy recovery in range of 95% to 100% of the full baseline model. 13:40.000 --> 13:44.960 One important note in addition to that is that in four quantization is usually a little bit 13:44.960 --> 13:50.000 trickier in this in the specific regime and more specifically for the smallest models in the family. 13:50.000 --> 13:54.800 As you can see Quen 1.5 V, Lama, A-B, we do see slightly higher drops here, 13:54.800 --> 13:59.840 but usually if this process is also done properly, these drops should never be below 90%. 13:59.840 --> 14:04.480 At least we have done a ton of research of this produced hundreds of models on the Hague Nesave 14:04.480 --> 14:10.080 and we still haven't found a single use case where this the accuracy recovery would go below 90%. 14:10.080 --> 14:13.600 So if that happens then something should be calibrated 14:13.600 --> 14:18.880 slightly better on the quantization part. And all of the times so far I'm just talking about Lama 14:18.880 --> 14:25.920 but all these techniques and all these techniques apply to vision language models or any other 14:25.920 --> 14:29.920 architecture that you find, it's just that Lama's these days are the most representative use case. 14:30.720 --> 14:36.560 So this was about the accuracy given that we are compressing a model, we usually expect some 14:36.560 --> 14:41.120 gains on the speed upside because now we have a smaller model and we want to run inference with that. 14:42.000 --> 14:49.120 So here what we're looking at is how the inter-token latency changes with respect to the number 14:49.120 --> 14:55.520 of queries per second and we are looking at Lama, A-B model served on a single A-6000 GPU for 14:55.520 --> 15:00.560 some specific use case doxling generation. And we are looking at three different models, B-F16. 15:00.560 --> 15:06.480 So this is the unquantized baseline, entade weight and activation quantized model and 15:06.480 --> 15:10.720 in four weight only quantized model. And they're like two interesting things to take away from 15:10.720 --> 15:16.720 this graph. The first one is that if our server is being hit with less than four queries per second 15:16.720 --> 15:22.480 we're going to be in this regime here where the best choice with respect to the smallest latency 15:22.480 --> 15:28.480 that we want to get from the model would be obtained by doing in four weight only quantization. 15:28.480 --> 15:34.640 And this is because in this specific use case here we are not bounded by compute. So we don't 15:34.640 --> 15:40.000 have many requests coming to our model. Therefore our GPUs are going to be idle most of the time. 15:40.000 --> 15:44.880 So what's going to happen here is that the main gain that we can get is by optimizing this 15:44.880 --> 15:53.360 weight loading part of the pipeline. Then the next part here becomes important that when we hit 15:53.360 --> 15:57.360 four queries per second, when we hit four queries per second what starts happening is that 15:57.360 --> 16:02.480 weight and activation quantization starts becoming better choice than weight only quantization. 16:02.480 --> 16:08.080 This is because at this point we have large enough inputs to keep our GPUs busy doing matrix 16:08.080 --> 16:12.800 matrix multiplications which means that now we need to optimize the computational part. This was 16:12.800 --> 16:18.160 the second part of the inference pipeline. And we can optimize that by doing quantization of both 16:18.160 --> 16:23.120 operands of the mathematical which means we are leveraging lower precision times our queries. 16:24.000 --> 16:29.040 And then if we push even more, if we push even more what's going to happen is that at some point 16:29.040 --> 16:34.400 we're going to get in the heavily compute boundary gene. So our inputs are going to be extremely large 16:34.400 --> 16:39.520 and weight only quantized models are going to become a voice choice than just deploying 16:39.520 --> 16:44.480 unequantized model. In practice because right now we're going to be heavily bounded by compute 16:44.480 --> 16:48.960 and the time needed to load the weights is going to be almost zero relative to the time that we 16:48.960 --> 16:53.920 spend in terms of course doing models. So you have to be really careful when you're deploying a 16:53.920 --> 17:00.320 models in order to figure out where on this graph you're deployment flies. And this depends on 17:00.320 --> 17:05.360 all of the factors in the game or like the model size, the GPU that you have and the requests 17:05.360 --> 17:12.000 that you that you that your server server is receiving. In order kind of to automate this because 17:12.000 --> 17:15.920 we had to do this many many times we developed a library. It's also part of the villain project 17:15.920 --> 17:20.960 it's called Guidelalan where you can just serve your model and then simulate real-world workloads 17:20.960 --> 17:25.040 and see how how the model behaves in different scenarios and then based on that plot something 17:25.040 --> 17:30.640 like this and see which quantization scheme is best for you. Given us we have five more minutes 17:30.640 --> 17:36.320 I'm probably going to try to wrap it here just just very short like part regarding the speculative 17:36.320 --> 17:42.720 decoding. So speculative decoding is relatively new technique. So quantization was like a lossy process 17:42.720 --> 17:48.720 because we are moving the weights around we are shifting the weights lossless speculative decoding 17:48.800 --> 17:54.560 is a lossless acceleration technique which means that we are not changing the model and the 17:54.560 --> 18:01.040 text that we get at the at the end of the speculative decoding process is like guaranteed to be the 18:01.040 --> 18:06.240 same text that we would get without it. The main caveat here is that we have to train additional 18:06.240 --> 18:12.480 model called speculator model which we're going to serve alongside our original model. So this speculator 18:12.480 --> 18:18.160 model like is a model which is like order of magnitude smaller than our very large model that we 18:18.160 --> 18:22.240 are originally interested in and this model is something we're going to run many many times 18:22.240 --> 18:26.960 like trying to produce three five tokens at a time and then we're going to take our larger model 18:26.960 --> 18:31.520 in this case it's called a verifier just to verify the outputs of this smaller model and then 18:31.520 --> 18:36.400 the larger model is going to say oh I agree with three out of five tokens these are the tokens 18:36.400 --> 18:41.040 that I would produce if I was in the decoding phase so I'm going to accept these three I'm going to 18:41.040 --> 18:47.040 reject these two and then go again and in this way if our speculator model has been trained properly 18:47.920 --> 18:53.680 we get some kind of scheme where we are generating multiple tokens at a time with a larger model 18:53.680 --> 18:59.200 for the cost of like running a couple of for passes through this very very small model and it's kind 18:59.200 --> 19:04.240 of feels I tried to illustrate it here in the terrific clear but the first pass like we have a prompt 19:04.240 --> 19:12.640 once upon the large model starts the the inference produces up and then we take this like once upon 19:12.800 --> 19:19.200 and we put it to we give it to a speculator model and we say okay please like speculate the next 19:19.200 --> 19:23.600 two tokens should be it's going to say time there and then we're going to take all these 19:23.600 --> 19:27.520 and we're going to put them through the verifier and say hey are do you agree with these tokens 19:27.520 --> 19:32.400 like are these tokens that you would produce and in this specific case the model said yes I would 19:32.400 --> 19:38.000 produce these we got three tokens for the cost of running just one token plus some small amount of 19:38.000 --> 19:43.200 time we spent in this speculator model then we have the we generated three tokens we pass it again 19:43.200 --> 19:48.880 to a speculator we try say oh generate the next two tokens the the the speculator model generates 19:48.880 --> 19:53.680 these two the verifier in this case does not agree with the scary so it's just going to discard 19:53.680 --> 19:59.360 the scary part take the the first one which which agrees with and then it's going to combine them 19:59.360 --> 20:05.120 and produce an x one and this entire process is great but there is one important part is that 20:05.600 --> 20:10.160 quantization was relatively easy to apply we take a model we just quantize it in one shot 20:10.160 --> 20:14.400 we have a model which is faster here the main cost that we have to pay that we have to train the 20:14.400 --> 20:18.960 speculator model so we have to take a large model and then we have to generate some data set 20:18.960 --> 20:22.800 and then we have to train smaller speculator model they're like different techniques how to train them 20:23.360 --> 20:28.160 to mimic kind of the distribution of the larger model so this is like a computational 20:28.160 --> 20:33.040 expensive part but it does does allow us to get like a loss as loss of speedups and for this purpose 20:33.040 --> 20:37.280 like we also developed a library in the VLM project called speculators where you can train these 20:37.280 --> 20:41.920 models for your own data sets or you can just go to hugging face hub and pick up some of the models 20:41.920 --> 20:46.080 that that we already released there so given that you don't have time I'm just going to skip through 20:46.080 --> 20:51.280 the like results yeah the cool part is that we can get the speedups of anywhere from two to five 20:51.280 --> 20:56.320 x depending on the model size and the quality of the speculator model but I'm going to skip this so 20:56.320 --> 21:01.440 we can maybe take some questions so yeah links like all all of the libraries are part of the VLM 21:01.440 --> 21:05.840 project open source you can you can just play with them there are standard examples if you 21:05.840 --> 21:09.920 don't want to do any of this you just want to get like high quality quantized or high quality 21:09.920 --> 21:14.640 speculator models you can go to red hats hugging face hub we are kind of releasing on a daily 21:14.640 --> 21:18.960 basis new models there it there is like more than 500 compressed models which have been already 21:18.960 --> 21:23.280 validated before so you're sure and you can see like what the accuracy recovery saw across 21:23.280 --> 21:28.800 different use cases so you can just download them and play with them yeah thanks a lot for your time 21:31.840 --> 21:53.600 yeah perfect yeah okay yeah so yeah so the question is how to tweak the recipe so the recipe so 21:53.600 --> 21:58.160 it really depends on the quantization algorithm that you want to use every single quantization algorithm 21:58.160 --> 22:03.360 has different knobs to tweak usually what you would do it would be like the like like the standard 22:03.360 --> 22:08.160 training loop you would take some development set like small some small subsets that you're you 22:08.160 --> 22:13.200 know not testing on and then you would just run a couple of quantization quantization runs with 22:13.200 --> 22:17.840 different like hyper it's like a like like tuning a training process with different hyper parameters 22:17.840 --> 22:22.720 and then you would you would evaluate on the development split and then you would like pick the best 22:22.720 --> 22:27.280 one and go to the test split usually there are some there are some guidance depending on which 22:27.280 --> 22:33.840 quantization algorithm you use on how to tune some specific pieces like for example GPTQ which tries 22:33.840 --> 22:38.880 to like approximate like the inverse Hessians then there's like some dampening therm blah blah blah 22:38.880 --> 22:44.480 which has to be picked like on a like a model line model basis so there is no like a general rule 22:44.480 --> 22:48.240 of time you know you should set this like this there is just like some kind of rule of time 22:48.240 --> 22:54.000 on how many tokens you can use for calibration so like like 501K tokens at that point you start 22:54.160 --> 22:59.120 seeing like a like a negligible improvement in the end of the results but apart from that everything 22:59.120 --> 23:16.880 is very specific to the algorithm that you use there is a question here yes yes yes so you always run them 23:16.880 --> 23:33.200 in parallel yeah so yeah the question is how to run speculators in vlm so basically in vlm we have 23:33.200 --> 23:38.560 support where you don't have two instances of vlm it's like a single instance but inside of that 23:38.560 --> 23:44.400 single instance you have a speculator model and your large model and if you do like TP2 as you said 23:44.400 --> 23:49.680 for example in this case you are splitting that like the tenser parallelism refers to your original 23:49.680 --> 23:53.600 model the speculator will still run on its own so it's kind of completely independent because 23:53.600 --> 23:58.400 it's really small model you don't get any gains by kind of splitting it or doing any fancy 23:58.400 --> 24:03.120 sharding with that so it's basically your entire model is still running in the same way you just 24:03.120 --> 24:09.840 attach one more smaller model which just runs in parallel and it's it's it's simple as a vlm serve 24:10.000 --> 24:14.400 you take a speculator model and do vlm serve you give a speculator and based on that it's going to 24:14.400 --> 24:20.960 do everything automatically for you yeah you don't have to do any custom stuff yeah yeah there is 24:20.960 --> 24:37.920 question order yeah so the question is regarding the validation of quantized models is it 24:38.320 --> 24:43.040 are we in a danger of overfitting for example for some specific data set yes that's like a really 24:43.040 --> 24:49.120 great question and it's still like on on going kind of problem because usually whenever we quantize 24:49.120 --> 24:53.760 the model which we open the model card on hacking face out and we see okay like you know 24:53.760 --> 24:59.600 mistral they published like these 10 evaluation benchmarks so our main task is okay let's try to 24:59.600 --> 25:05.120 do quantization and recover on these benchmarks that they proposed and that's kind of the standard 25:05.120 --> 25:09.840 setup but we usually do maybe there is something that we are missing along the way so we are always 25:09.840 --> 25:15.600 trying to add new new benchmarks to the mix but we still haven't found a single benchmark where 25:15.600 --> 25:22.000 the entire story that we have about accuracy recovery being about 90% fails so yeah we kind of we have 25:22.000 --> 25:26.160 a paper where we did more than a million evaluations across many different benchmarks is called 25:26.160 --> 25:32.160 give me bf 16 or give me death accuracy performance traders in in quantization and they they're basically 25:32.240 --> 25:38.000 presented the study like a large scale study we took every single benchmark that exists out there 25:38.000 --> 25:44.640 from the arena hard coding a hacking face leader board three one we two and so on and we still 25:44.640 --> 25:51.520 haven't been able to find the single benchmark where this this fails yeah yeah 25:52.160 --> 26:11.280 so so so the the question is relation with the llama yeah 26:11.360 --> 26:27.680 okay yeah so the violum supports llama cpp models in general i think but i'm not exactly sure 26:27.680 --> 26:33.360 what support for the quantize now because llama cpp has its own like way of doing the quantization 26:33.360 --> 26:38.960 like q4 q3 and all these different schemes and they're all doing weight only quantization as far 26:39.920 --> 26:44.720 as i'm aware of and i think that violum should support them but i'm not sure that that's 26:44.720 --> 26:50.320 well tested path at all i think there is some kind of way to run llama cpp models but i don't think 26:50.320 --> 26:54.960 it's super performant at least at this point yeah i know that there are like some people working on it 26:56.160 --> 27:01.920 specifically in in redhead there is a new team which is supposed to bring llama cpp to be a first 27:01.920 --> 27:08.480 class citizen of violum but i'm not not really like in touch with that with that pipeline we mostly 27:08.560 --> 27:12.240 do violum like on on gp's that's kind of the main to many of the skis 27:27.920 --> 27:29.920 great