Bravia XR's whole neuron marketing is basically AI image detection. Oh an human, I recognize that. Will human watching TV look at human center of big screen and I should clear up around for better focus? Color grading, focus, highlights, etc.
TV source material has no motion vectors and they can live with high latencies that would never work with a game. There's a reason these are typically disabled for gaming mode.
Video processing would leverage pixels because that's all it gets. Games get access to much more data from sampling graphics. In graphics we can take accurate point sample functions with more data available in that sample like HDR value, exposure, subpixel offset, depth. etc.
Video processing would use optical flow, and it works because they can have access to much more frames in the past and future because it is not latency dependant, they don't have this in games. Games use geometry motion vectors and they cannot afford to to stack so much future/past frames like video processing.
So really not much to take from gaming upscaling vs a TV processing unit and after watching Cerny's presentation I honestly don't get where you want to go with this. What you can take away from Bravia XR is super resolution... which we get to down below..
That's the very definition of super resolution upscaling, as Cerny says and employs the same wording as the rest of the industry.
Then you fill in the gaps as Cerny says.
You can do it the dumb way and fill in the holes by copying/averaging the values from the closest pixels (nearest neighboy upscaling), but that will look like shit. Then you can say oh well, why not bilinear / bicubic upscaling algorithm, but that still won't look good enough and smears the image. Algotrithms stop there for the most part on the thesis that you cannot recover missing data by further processing.
That's where neural network comes in. Cramming millions of high quality images in a convoluted neural network and breaking it down into smaller and smaller subsamples (feature extraction) until you get to the fully connected neural network where convolution will happen and then you'll be able to fill in the data for output with micro details samplings.
PSSR, XeSS, DLSS do the same thing. Some slight recipe and parameters but its the same recipe. None of them truely invented the fundamental tech here, not even Nvidia. Prior papers on all this were developed in universities pretty much. What Nvidia gets credit for is the first to make it a use case for real-time game rendering. The training model is the secret recipe, not so much the super resolution part.
So that's the super resolution part
But games don't work like upscaling a .jpg, and none of the above solutions dumbly reconstruct every holes / every gaps frame by frame. That's not efficient, that's slow.
The spatial temporal part --→ Game Movement.
All of them use the TAA framework
All the last decade solutions are on the same baseline, they use frames in a pipeline as frame N and N+1, sometimes N-1, rarely you'll have more than 3 frames in a pipeline, technologies like, checkerboard, Temporal upsampling, PSSR, TAA, ATAA, DLSS, XeSS, FSR
But that checkerboard example is a cute neat way to resolve that would need almost interlaced frame N & Frame N+1 and that's not what happen in games of course so that's why it got much more sophisticated and the traditional spatial-temporal upsampling leverage heuristics to identify invalid samples from previous frames.
Now what they are all trying to achieve here? A hack of supersampling of course.
Supersampling is too expensive so then they had MSAA which limited number of pixels with multiple samples only around the edges of geometry, but basically anything not on the edge of a triangle was ignored so no improvement to transparency or internal texture detail.
TAA converts spatial averaging of supersampling into temporal average.
The same is true for all modern upscalers. But by using the framework of TAA, frame to frame and motion vectors, you'll suddenly have a bunch of gaps filled naturally with the next frame and almost as neat as a checkerboard interlacing, but not all gaps. Still what this means is that you just filled a very large portion of holes from frame-to-frame.
So why the fuck AI ?
Each frame in TAA renders at 1 sample per pixel, 1spp.
But TAA has an history problems from with fine geometry and complex lighting/denoising. They all do.
So the mask above is by PSSR patent's wording what they refer to as "bigger holes" that AI CNN takes over to correct. Much like super resolution, AI is better placed to fill in missing information than algorithms.
The above image is from Nvidia's ATAA. They implemented the mask to detect these failure points and then focus with rendering in these areas, but still an heuristic solution. For this solution Nvidia knocked at Microsoft's door for implementing conservative rasterization circa 2016~17.
DLSS CNN started because of this project at Nvidia which was initially just to repair photos.
Here at the DLSS 2 presentation they refer to it @ 0:45
So they realized they can fix the above issues with CNN model from that dude's .jpg repair tool.
DLSS uses the same masks as ATAA along with conservative rasterization from DirectX (and then Vulkan later). I don't think I can actually share the pdf of DLSS programming guide as it is confidential as of 27 Jan 2025's last update but I'm sure you can find it by a nice google search. From clause 3.6.4 of the DLSS programming guide they explain it well the continuation of the above solution. If you are rendering objects with thin geometric features they tend to get in and out of view due to the low resolution of the input buffer (full of gaps and mutiple frames means in/out) and the rasterizer missing out on some parts of that geometry. These are holes, much like the above image from ATAA.
So, those missing features, how do you reconstruct it? They are missing. They have no motion vectors nor color, its truely a gap. Without any further help, DLSS will incorrectly associate the previous frame of that object with the motion vector of the background, so rather than trying to mend the holes in the object, those holes persist.
So to help DLSS reconstruct, basically laser focus AI into the interesting regions, because again, they are not dumbly looking at every pixels and AI it, the TAA framework already did most of the grunt work, they use conservative raster (from ATAA implementation), because it ensures that if a primitive touches a pixel even a slight amount, it gets drawn, with motion vector. Then DLSS can reconstruct that hole much better than previous ATAA and heuristic solutions. Same as Cerny's paper explanation of the machien learning inference process to fill a hole with higher quality fill.
So CNN network models do not look at every pixels rendered and reconstruct the neighbours for a full screen. I saw some of your posts from the past where you think PSSR is so smart because it uses AI only when it needs to do it.... yea, they do that.
They use AI CNN to fix the failures of Native+TAA like their previous attempt with ATAA. They saw that fixing failures such as what is masked from frameN to frame N+1 in TAA is better filled out by the photo editor CNN model.
I also saw one of your claim from before -
"the final image will actually have at least 1/4 of the output resolution rendered natively (so 1080p's worth of pixels in a 4K output) rather than 100% predicted from a lower mipmap like DLSS/XeSS"
The resolution part is already detailed previously in my post, but the mipmap part is false also, DLSS CNN is NOT designed to enhance texture resolution, its not supposed to turn low resolution textures into high resolution ones. Texture Mip bias in the engine integration should be set so textures are sampled to have the same resolution as native rendering. DLSS CNN is trying to fix temporal failures like the above image. In the pipeline it is after the input (Geomtry/Shading @ ex: 1080p) → DL Upsampling (ex: 4k) → Post processes like Mipmap bias , tonemap, depth of field, motion blur, blood.., etc. This is detailed by
Edward Liu, NVIDIA GTC 2020.
To close
As of just a few weeks ago
"A CNN processes pixel information through local operations spatially around a small number of neighboring pixels and temporarily across multiple frames"
Which is pretty much TL: DR of what I detailed above.
And all that just went out the window with transformer upscaler which I have no fucking clue how they do it.
As per video, they bring "reason" to image detection, use self attention, longer range pattern across a much larger pixel window.
so that means to mean that rather than focusing on a tile like CNN models do, it can look at a much broader view and has pattern recognition, maybe not even having to point fingers with masks, but that's something I am not sure about, too new. Something to learn in the future.
Will probably ask a graphic engine programmer I know who was also behind big AAA studio hits with their inhouse engines and participated a lot with Nvidia in the past in implementations to explain to me like I'm five what the transformer model is doing because I know he's already fully going through the SDK and implementation along with talking with his other nerd colleagues.