Adaptive Sampling in Path Tracing

Hello. In this blog, I will share my journey in developing adaptive sampling systems for path tracing. This will not be the only thing as I will be also making important changes of my current path tracer. Let me start!

Preword

I chose my project topic to adaptive sampling. My older project idea was about spectral tracing since I thought the visual quality of spectral rendering images'd be nice. However, our 3rd homework and the ones that follow that changed my view as I felt the current bottleneck of my path tracer (and the current tracers) is the noise.

There are lots of explored, partially explored and unexplored ways to improve efficiency -reducing noise is one of them- in path tracing. When we implemented sampling methods in path tracing, the first issue I thought of was that a lot of the samples were redundant. For example, we may spend a substantial amount of time to send 1000 samples into background to only get the same value from all of them.

Adaptive sampling is a class of methods to solve this kind of redundancies with or without introducing bias.

Adaptive Sampling


Theory of adaptive sampling is explained in Wikipedia here. Actually, I am the author of the text -until the beginning of the part about its use in molecular biology.

Its use on path tracing is similar. Let x,y be pixel coordinate inputs to a function f. The output is the perfect pixel value of the scene being rendered. The perfect value may depend on the monitor being used, the viewer's taste and experiences, the current environmental lighting... However subjective this can be, there is at least a theoretical closed interval in this more than million-dimensional space, the images in which are guaranteed to be better than the images outside.

In path tracing, what we actually do is try to write estimators that will emulate that function f. We are bottlenecked by computational power. The SPP values in the scenes given to us (and also the dimensions that define a picture) then represent the computational limits we have.

So one can first feel like then sending samples per pixel is not the correct way, since we are limiting our estimator with a constraint. Anyone who thinks this way is both correct and wrong. Let me explain. It is correct that limiting one's solution means that almost %100 the solution is not the best one. However; it is an also abstract truth that since the images we produce are going to be viewed on devices with square pixels, the perfect sampler must have to do with the pixel positions. So the pixel boundaries contribute,exist in the ideal sampling solution.

Adaptive sampling is not only about picking which position in a pixel to sample next. That is in fact probably the most basic one. I do not see any method that does this, but in theory, you can even decide to skip some pixels. This makes sense in the scenes with half of the pixels are just blank background color. So sampling near toward the ends of objects where there is the most frequency might be enough and might save you a lot of rays with a cost of increased risk of missing a normally visible far object.

You can detach from the pixel field and sample the whole image as a continuous 5d [RxR -> RxRxR (red green blue)] function, make a hierarchical division of the image...

Another component of the path tracers where adaptive sampling may -or will(?)- help is, splitting. So rememver when you split you uniformly sample the hemisphere? Well you can apply any type of sampling instead of uniform. You can use stratified sampling, for example for better dispersion. Or you can decide how much you split by using adaptive sampling as follows:


Send 2 random samples (maybe stratified so that you are sure they are apart from each other), then check the difference of colors (entropy, noise). If the difference is too much (This is the heart of present adaptive sampling techniques, usually the algorithms differ in this...), you send another ray. You send rays (split) until you are satisfied with your result. Note that this can be done recursively. The satisfaction threshold can be determined by using the overall contribution of the rays, depth (this is related with the former), total samples left (if this is the second pass), the previous results on close-by areas (Implementing this'd be very tricky, I beileve)...

Plans

I want to implement at least one adaptive sampling technique to show how it performs.

In this project, the things I plan to do is not only about adaptive sampling. I also will make some changes in my path tracer to make it more interactive, if I have time. I plan to change the ways I allocate the rendering to threads, for example. I will output a windowing system to be able to see the sampling in real time (That was how I worked in the previous homeworks, since getting a image can take long and is harder to debug.).

How It Went

I, as always, started this project by correcting the most recent version of the ray/path tracer. I always do this, because at the time of submission, usually my program is far far from perfect.

Corrections and Additions

Screenshot Shortcut

I was always taking the screenshot or video of my window -where I do the path tracing. For this project, I added a shortcut key to save the current image in png format.

Improving Estimator

I had a estimator during the rendering to have a idea of how long the rendering will take. It was always printing the total time, not the remaining time. While it was still useful no matter what, now it is updated so that it also prints the remaining time.


HDR (Tone Mapping)

That took a lot of my time -not implying this have made sense, you know. I have been looking forward to create/use a local tone mapping algorithm since the HW5. However, I did not have the time so I delayed this till the project. I first made a tiny research about local tone mapping operators. After that, realized that I have no ways of implementing a local tone mapper and also finishing the project in time. But then my tone mapping operator is known -by me; however, I mentioned this in previous homeworks as I remember- to fail in some cases and that was distracting because one could have a tough time guessing if the color difference is from the tone mapper or sth. else...

Anyway, after I had given up doing anything about local tone mappers, I decided to still pursue the tone mapping issue I have been taking more and more personally (joking here). Here is the cornellbox scene in path tracing if it had 32 SPP (samples per pixel) [left, of course :D]

The one on the right is the baseline and 22500 SPP. We are not really comparing ants and mouses here, the problem is the unnecessary overall color tone difference which is easier to felt than explained.

The code block that computes lw. I have always hated doing stuff against authority. The paper clearly says that we add a delta value (How to determine it either is omitted by the authors or my eyes.) inside ln() and then divide exp(sum) by N. I know that whenever I take exp(sum), the value of sum is too big so the result is infinity, or meaningless. So since I had no time to fool around that was my little unbiased -ehem- solution back then.

Now I tried too many things to correct that algorithm but finally I wrote some working code block that seem to have acceptable results:

So I still do the stupid biased division and just changed the "small" delta value to 1 and divide every element before to prevent huge values spoiling the result.

My 32SPP version after that:
It still seems off if you ask me, but I could not care at that moment; this is not a priority.

Next Event Estimation

This was one unfinished part of my last homework. Thinking that it'd be good to work on in my project, I attempted to finish the code for that (I had had to cut it and leave it there in the last homework.). However unsuccessful I was, most probably there is so little work for it to be working correctly.

Applying next event estimation is straightforward when the light is of types in the homeworks 1-5. However, I wanted to implement this on Jaroslav Path Prism Light scene.

The initial version I prototyped was as follows: During the parse of the scene, every light meshes estimated center will be calculated. Then, the NEE rays will be sent towards the center of the light.

I chose this model: whether there is splitting, NEE or any other method, I will make them work as equal samples. So I divide by the total samples (in the uniform sampling method, at least.), not caring if the NEE rays and GI (global illumination) rays hit the same light or not. At the implementation stage, this made sense, however, I realize that the NEE "samples" do not bear the same characteristics as the GI samples. Because we test if the NEE ray hits the light or not. We -ideally- should not regard them as they are just another GI sample, if we will not treat them as GI samples. I guess NEE is a heuristic, rather than a %100 unbiased hack. For example, if we have point lights and we use NEE to have GI... In Turkish: hem ayranım dökülmesin hem de path tracer'im bias'ed olmasın...

Enough thought on that. I do not think (as a joke) my result below deserves this much:


Well my first (during and until 0,3 seconds of my neural pathways' thinking loops) thought was it may be the desired result and it is just my path tracer's uniform sampling that makes the image quality look awful. Thanks to god I had a visual debugger:

Here you see the NEE ray.

Here are more.

I was suggest-baited by a LLM again. The auto-completor's suggested code was calculating the middle point of the mesh wrong. However, as funny as it seems, I decided to sleep over it and did not fix it. Because I knew that NEE was just one of the pile of things my ray tracer lacks, and I feared to implement NEE and then never be able to use it on my adaptive sampler -What was coming was bound to arrive...

Sponza

I have never been able to render this scene fully, I tried to see the reason. The visual debugging outputs (Which were probably showing the GI rays) were talking about something being off:

I leaved this here to try again if I had any time.

Restructuring

I restructured my code to represent light types as child classes of a parent light class. I did this to make my code more editable.

Correct Uniform Sampling

I corrected my uniform sampling code. It was not uniform as I explained in the previous blog post. Guess which one is uniform?
The one on the right is the result of uniform sampling. You can see that it actually looks worse.

Starting the Project

I planned to make the rendering more natural by dividing jobs into tasks. Some tasks were singular and are designed to get executed on only one thread, without switching threads. Some are (for example the ray tracing) designed to be executed on multiple threads, with threads taking a desired amount of tasks to finish and finishing.

Threads

Event Handler

I always had the chance to interact with the ray tracer during render. I did this by calling my event handler function after every row finishes rendering. This was acceptable, but sometimes even a single line takes too long to finish. So I decided to make this event handling mechanism work in another thread, now sleeping (to not consume a single core, since the response time is not too important.) between calling the event handling function.

View Updater

This periodically puts the current image in the buffer to the window.

Task Manager

This is the part that is important in the adaptive sampling. This manager is responsible of deciding which pixels of the image needs sampling. In the uniform sampling case, this agent will basically "command" that a fixed number of samples are needed for each pixel.

There is now a tasks[xres][yres] matrix. Each is a struct Task {int count=0; bool assigned=false}. The count means how many samples we need for the corresponding pixel. The meaning of assigned will be explained later.

In the old sampling case, the task manager should just put count=SPP and return.

In my adaptive sampling case, the task manager first generates 2 initial sample tasks per each pixel. (Generating a sample, here means incrementing the count field of that pixel.) Then until the minimum total samples (SPP in the given scene file * xres * yres) are reached, decides which block of pixels are next to be sampled. The sampler decides on this by traversing the image block by block and calculating sum of the entropies of pixels in those blocks. The block with maximum entropy is going to be selected as the new block to sample next.

This thread does this in a while loop until it assigns enough "task"s.

More on this thread later.

Sampler

Normally I was doing my sampling stuff in for loops in main, using OpenMP to make the loop //. Now there is a sampler thread that handles a block of pixels.

Basically a sampler thread is going to finish the assigned tasks to corresponding pixels in a block. Block size here is a hyperparameter. A sampler thread will sample the pixels and put them in the pixel's corresponding sample container. There is a global (or local, does not matter if the ownership is given anyway) vector<Sample/*aka vector3*/> sample_matrix[xres][yres] container. The sampler will push the new samples in. After pushing the samples, the sampler will also calculate the resulting color of each pixel it sampled and write it on the image matrix and image_data buffer for XLib (The latter is for visual purposes).

Each sampler will increment a std::atomic<int> total_samples counter to be able to keep track of the overall progress.

The explanations here are correct for the initial versions of my sampler. Some parts changed, and I am going to explain how things went.

What I Learned in this Project? (-or ISSUES, as a heading)

First of all, the manual control of multithreading is not straightforward at all. I cannot even explain -nor do I remember all- the weird issues I had. I even had "stack smashing detected" errors!

Now to keep explaining about the issue, you need to watch this.


Ok just focus on the middle of the ceiling in this good old Cornell Box Area scene, where the area light is closest. The original scene has 100 SPP. I reduced the SPP to 3: minimal samples for adaptive sampling to make a difference (In my implementation, of course. Theoretically it can make a difference in as small as 5 samples... No, I mean 5 samples for a single image, not 5SPP. But that does not mean that I could write a algorithm for that. Anything beyond pixel limit means a lot of work.).

If you are patient, you will see that the basic entropy algorithm actually detects the ceiling where there is noise. Even you'd probably guess there needs more sampling by looking.

There is a problem: the algorithm keeps picking the same spot even after it is sampled correctly. The reason of that was at first, about the speed difference between the sampler and the task manager. Task manager is giving the same task before the sampler/renderer is able to finish it. This could be solved by some stupid sleeps or additional flags, or sth. like that. (The flag solution sounds dirty, to be honest.) I first tried to make it sleep after deciding on a sample area. The problem still occured. I was sure that the sampler finished the task, and the task manager picked the same block.

The reason was that I was not taking average of the entropies per pixel: so more samples mean more entropy, however, I was interested in not the total but the average entropy. So I changed the code to that, and the same problem occured. I even changed the entropy calculator but had no luck.

I thought maybe the problem is about the main loop not knowing where to sample next while the task manager knows where the sampler should sample next. So the program blindly loops over all the image to only sample a single part.

The Evil: Multi-Threading 

I changed into a less matrix-like solution: Tasks were going to be pushed in a container and popped from that. The tasks'd include the pixel coordinate to be sampled. However, this was a really bad idea because multiple threads modifying a single container'd mean A LOT OF troubles. So, I wrote and tested some unsuccessful code to return back to the same problem.

From my experience, I learned that multithreading is not easy to obtain. You can make everything atomic, but then say goodbye to advantages of caching. So; one needs to either allocate time to design a clear and complex multithreading solution, or come up with a solution where the cases when any thread needs to write on a shared memory location are as rare as possible. In my case, this corresponds to spatial division of concurrent jobs. For example, it may sound cool that a single pixel's samples are calculated by different threads in a 2000 SPP scene. However, the results must be summed and the problem becomes too complex.

So let us consider the aforementioned spatial division rule: My design was hard to implement correctly since the tasks are going to be written on both the sampler threads and the task manager. This requires really careful plan. And the worst enemy in this kind of situations are not the thread safety issues, it is actually what comes after that, if you did not make a great effort on design: unsolvable inefficiency. The constraints on thread safety converts normally straightforward efficiency problems into algorithmic labytrinths where you can get lost on loops until dying because of dehydration.

Nature of Adaptive Sampling

No free meal. Everything comes with a cost. Most of the time, there is a theoretical lower bound of overhead in optimizations (let it be heuristic or not). Adaptive sampling is not a exception. The cost of calculating the next sample (sample block) is this kind of a overhead. However, during the development and problems I faced implementing this, I realized a bigger and deeper thing: the nature of adaptive sampling.

Adaptive sampling is sequential by nature. When you increase the block size in my approach, you actually decrease quality because maybe the next sample will give you a idea and you will know better than the other samples in the block. But whenever you decrease the block size, you make the program sequential. The blocking itself is a bias already -bias, in terms of estimating the estimator; so, meaning that even if you change other hyper-parameters, having block size as a parameter will never get you to the ideal estimator... So, you are practically balancing a trade-off about parallelism and adaptivity. Or, be a genius and find a way to make the sequential decision making process as parallel as possible. Probably there is a theoretical limit in that.

I learned all these when trying to solve the problems of my adaptive sampler and realizing that some solutions that come to my mind actually annihilates the parallelism, which is already the main reason of the problems!

Conclusion and Solution

I could not finish this project with satisfying results. Did not have time. It was only literally the last 2-3 minutes before I could get a working adaptive sampler. How I got that was, I removed the task manager totally. Then I made it so that -apart from initial samples- in a while loop the entropy of each cell (pixel) is calculated. Then the mean of these entropies are taken. And then every pixel whose entropy is larger than the mean is assigned to be sampled. This iterative process is continued until the sample limit is reached.

I did not have the time to test this on multiple scenes and compare it with the other sampling methods (compare the quality/time metrics). I'd like to update this post if I was able to obtain meaningful results tomorrow.

Updates

Note: The text below represent the updates after January 23rd, 24.00

Update 1: Visualization

I was able to visualize the adaptive sampling process. The comparative result will be available later. This video shows the sampled regions over time.

Update 2: Improvements and Comparative Results

the desired output from the Cornell Box Area Scene: 100SPP

I will explain the content in chronogical order.

AS (adaptive sampling) v1, 6SPP, 1 minutes

When I run my AS.v1 algorithm, the result is as follows with 6 SPP and 1 minutes. OK we can see that the samples are not wasted here, that is good. However, 1 minutes is too long and I believed that there can be other modifications to the algorithm.

The key concept in my AS algorithm is entropy calculation. It is sum(sample - mean) for each pixel. mean is the average of all samples in that pixel. I used to divide this by the sample count's square, however, the degree of freedom here is actually sample count - 1: If the pixel had 1 samples, it'd have no entropies: 0/0 corresponds to that lack of information. I changed the code to divide by (n-1)^2. Then ran the program again:

AS v2, 6SPP, 55s
As you can see (You can click to the images to view them sequentially.), the noise is slightly reduced and also the time is shortened. The reason of time shortage is not surprising: Each adaptive sampling loop involves a non-deterministic counts of new samples. The more the samples per loop, the faster the program, since usually that means less iterations on the image. This normally corresponds to a tradeoff, however, this case made it better since the positions that need more sampling are estimated more correctly.

Normally, the pixels whose entropy is higher than the mean is sampled once. I wondered how'd it be if I added more samples when the pixel's entropy is higher.

AS v3, 6SPP, 12s

The result is more noisy. This is probably a trade-off. However, this detail will be less and less relevant as I explain the following updates on my sampling system. Also, there is a elephant in the room. You can guess what it is, I will explain it later, while giving hints.

Problems In Path Tracing

Cornell Box Path Tracing Scene, AS v3, 6SPP

This is the result in my sampling algorithm in a path tracing scene (Cornell Box). I could not find why the problem exists, and guessed that the path tracer was oversampling the light and thus the tone mapper was outputting a all dark image. In reality, the problem was both that and also something else.

Improving the Adaptive Sampling System

I decided to take care of the path tracer later, while focusing on comparisons and improvements of my current sampling system.

First thing I did was rendering the same scene with same samples with and without AS:
    Cornell Box Area, 6SPP
    left is regular sampling, took 4 seconds; right is adaptive sampling, took 24 seconds.


It not only looked worse, also it took worse. Some parts of the image are less noisy, however you might also notice that some parts of the image have sharp noisy images. Also, if we compared the equal rendering times, 24 seconds corresponds to 18 SPP in the naive method!

I was using atomic float to accumulate the parallelly calculated entropies, I removed it and seperated calculation and summation part to achieve much better result in adaptive sampling. The same scene took 11 seconds. However, this still would not explain the fact that adaptive sampling system actually performs worse than the non-adaptive one.

If you watch the video below, you can see that the algorithm can estimate the initial positions where the entropy is higher, it gets stuck at those positions. You can observe that in the shadows. And that is why the results have lots of noisy pixels.

That is the problem in the path tracing scene! The rays mostly miss the light and result in black in path tracing. That is why a lot of samples are needed. My adaptive sampling system first takes 2 initial samples and then determines the next adaptive samples according to those. If the 2 samples somehow result in equal colors, this meant that the initial entropy of that pixel is 0. And this pixel is never sampled again.

Another problem I realized was, the white portions of the Cornell Box Scene were oversampled. I compared with the baseline image and weirdly the white oversampled regions of the image were not different at all, yet alone look noisy, from the baseline. So I realized that the unclamped results were actually noisy. However, they were all clamped to 255 thus no visible noise was present.

So, I changed the entropy calculating part of my system to check if the current camera is to be tone mapped or not, and if not, clamp every sample in the entropy calculation process:

The results are 1-faster, 2-better. Here are comparisons:

AS, 3 SPP, 2 seconds

default, 4 SPP, 2 seconds

***
AS, 6 SPP, 4 seconds
default, 7 SPP, 4 seconds
***
AS, 12 SPP, 10 seconds
default, 16 SPP, 10 seconds
***
AS, 24 SPP, 25 seconds
default, 32 SPP, 26 seconds

Notice the complexity of the adaptive sampling system is more than O(n)...

A Mistake

The mistake (can also be pushed as a heuristic) is that a single sample being more than 255 does not mean that it needs to be clamped to calculate the actual entropy. Actually, this is what should be done: -In a non-tonemapped image,-
1. When mean is less than 255, any sample that is bigger than 255 should be counted. Because that sample affects the mean of the sample-set.
2. When the mean is more than 255, the mean should first be clamped to 255. Then the samples bigger than 255 should be counted as 0 since we do not know that the sample affects the mean or not. This can feel more intuitive if you realize that if there are a lot of samples in a pixel that'd not change the mean if it is sampled again, then that pixel is less likely to need sampling.

This, of course, opens another topic about images that are to be tone mapped. Since the hypothesis uphere assumes that there might be a need to consider the image to be observed, rather than to be tone mapped for the entropy calculations. A fluctuation of millions of luminances between 2 samples in a very bright region of a image can mean less than a 0,1 difference between 2 samples in another region. And the only thing we can do is to normalize the collected samples before applying statistical analysis on them (such as computing entropy). Even what normalization can do on its best is to just count 2 cases equally.

I did not and could not address this problem, and the solution hypothesis up here is not backed by any formal proof.

A Improvement

Let us come back to unsolved "getting stuck" problem. The problem is actually 2-fold in its nature. One is local "stuckness": Some pixels of a shadow region cannot be sampled even if the nearby pixels are correctly found to have high entropy. I attempted to solve this problem by generating samples for the 4 nearby pixels too. So, over time hopefully this will result in the algorithm finding each contiguous high-entropy region.

The other side of the problem is global "stuckness" (Sorry for the bad naming, I really could not think of any other name...). Where the local one could be solved using the former solution, the global problem is harder to approach. It can be solved by adding random samples in each pass, etc. (Which reminds me about the famous concept of "exploration vs. exploitation"; where exploration resembles adding random samples and exploitation resembles adaptive samples).

The reason why approaching global stuckness can be tricky is that the solutions need a balance. There is a risk of killing the purpose of adaptivity -at least that is how I feel. A solution that crosses my mind is sampling a couple of "cold"(=low entropy) pixels in a low amount each loop.

Here are the last solution's visuals:

First, the overlay view video showing the cumulative sampled regions: (I reset the view in the middle to see the new samples)

Now we know that adaptive sampling surpasses the naive solution. So let me compare it with the previous adaptive sampling system below.

old, 3 SPP, 2s
new, 3 SPP, 2s

***
old, 6 SPP, 4s

new, 6 SPP, 4s
As you can see, there is a little improvement.

With these comparisons and a last remark, I'd like to conclude my big update: The elephant in the room was the O(n) memory complexity. It is not sustainable to keep every sample for every pixel. I leave that here as a future work.

Comments

Popular posts from this blog

My Ray Tracing Adventure (Fall 2024)

Path Tracing (Homework 6)