Skip to content

Image upload benchamark#238

Open
CrabExtra wants to merge 9 commits intomasterfrom
image_upload_benchamark
Open

Image upload benchamark#238
CrabExtra wants to merge 9 commits intomasterfrom
image_upload_benchamark

Conversation

@CrabExtra
Copy link
Copy Markdown
Contributor

Added simple Image uploading benchmark

Comment thread 73_ImageUploadBenchmark/main.cpp Outdated
Comment thread 73_ImageUploadBenchmark/main.cpp Outdated
Comment thread 73_ImageUploadBenchmark/main.cpp Outdated
Comment thread 73_ImageUploadBenchmark/main.cpp Outdated
}

private:
void transitionImageLayout(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's a PITA to have to write barriers, but trust me, if we could have a a better simple transition/barrier function, we would've made it.

what you can do is create a default one barrier. where your subresourceRange is fixed or image is fixed and start from modifying that each time. you're going to have different dst/src+access/stage mask combinations for different barriers each time

  • Pro Tip: Enable synchronization validation by overriding getAPIFeaturesToEnable() and look for possible missing barriers and sync from vulkan validation layer bugs.
	virtual video::IAPIConnection::SFeatures getAPIFeaturesToEnable() override
	{
		auto retval = base_t::getAPIFeaturesToEnable();
		retval.validations = true;
		retval.synchronizationValidation = true;
		return retval;
	}

Comment thread 73_ImageUploadBenchmark/main.cpp Outdated
Comment thread 73_ImageUploadBenchmark/main.cpp
Comment thread 73_ImageUploadBenchmark/main.cpp
Comment thread 73_ImageUploadBenchmark/main.cpp Outdated
Comment thread 73_ImageUploadBenchmark/main.cpp Outdated
Comment on lines +120 to +136
uint32_t mortonCompact1By1(uint32_t x)
{
x &= 0x55555555u;
x = (x ^ (x >> 1u)) & 0x33333333u;
x = (x ^ (x >> 2u)) & 0x0f0f0f0fu;
x = (x ^ (x >> 4u)) & 0x00ff00ffu;
x = (x ^ (x >> 8u)) & 0x0000ffffu;
return x;
}

uint32_t2 mortonDecode(uint32_t code)
{
return uint32_t2(
mortonCompact1By1(code),
mortonCompact1By1(code >> 1u)
);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already have morton code function in our hlsl library.
already mentioned here: https://discord.com/channels/593902898015109131/1450452175234011147/1471719738525614334

remove this and include "nbl/builtin/hlsl/morton.hlsl".
if usage is not clear search on discord and ask

}


double runBenchmarkImageStaging(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was a failed attempt because we realized we cannot work with preinitialized images and optimal layout in host visible memory, correct?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's remove the body of the function and replace it with comment, save some LoC and add some documentation here on why we failed.

Comment thread 73_ImageUploadBenchmark/main.cpp Outdated

bool keepRunning() override { return m_frameIndex < VERIFICATION_LOOP_COUNT; }

void workLoopBody() override
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this workLoopBody?
I expected it to be empty and benchmarks just running in app initialization and then app gets terminated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you need to verify your compute algorithms, just run them once after the main benchmarks?
I don't get the point of VERIFICATION_LOOP_COUNT

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but good attempt to verify your results. always good practice,

smart_refctd_ptr<ISemaphore> m_sem;
uint32_t m_frameIndex = 0;

void runAllBenchmarks()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function looks good

double memcpyGBps;
};

void generateTileCopyRegions(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks ok, I had already reviewed it

Comment thread 73_ImageUploadBenchmark/main.cpp Outdated
}
}

BenchResult runBenchmark(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to runBenchmarkBufferToImageCopyCommand or something better that shows it's different than the compute one

double memcpyGBps = totalGB / totalMemcpyTime;
m_logger->log(" Memcpy speed: %.2f GB/s", ILogger::ELL_PERFORMANCE, memcpyGBps);

return { wallThroughputGBps, gpuThroughputGBps, memcpyGBps };
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this functions looks good and sane too


static const uint32_t TILE_SIZE = 128u;

[numthreads(128, 1, 1)]
Copy link
Copy Markdown
Contributor

@Erfan-Ahmadi Erfan-Ahmadi Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • try 2D workgroups.
  • make workgroup size 128x4 (make sure it reflects on you dispatch) --> 128*4=512 is a good workgroup size, it can fit exactly 3 workgroups on modern SMs
  • pixelPos will be just the global thread idx. we need to have individual offsets for each tile requests
  • your tile size is PoT, you could just use bitshift with 7 for division << TILE_SIZE_LOG2. for modulo use &127u or &(TILE_SIZE-1)
  • tileIdx will globalPos.xy << 7 (define TILE_SIZE_LOG2)
  • localPos will be globalPos.xy&127
  • your start read location will be tileIdxTILE_SIZETILE_SIZE or tileIdx << 14u
  • you get it :D I won't continue


[numthreads(128, 1, 1)]
[shader("compute")]
void MortonStore(uint32_t3 ID : SV_DispatchThreadID)
Copy link
Copy Markdown
Contributor

@Erfan-Ahmadi Erfan-Ahmadi Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • make the workgroup 2d 16x16 or a 512 1D workgroup
  • but it needs to handle 16x16 region of the tile
  • your read pos stays the same with flattened global idx (we need to make sure you're reading contigously)
  • morton::code<false, 7, 2> mc; you now only need 4 bits for 16x16 so it becomes 4,2 I think
  • use the bitshift and & for division and modulo like my prev commit.
    • it's very likely the compiler already does this optimization for you since TILE_SIZE is a macro, but it's good practice, in case it changed later to a push constant or something not known at compile time
  • make sure this change is reflected on your dispatch;
    • since each tile is 128x128, it'll take 64 workgroups of size 512 to handle copy of a single tile for you
    • use morton code locally within this 16x16 to figure out the write location + add offset.
    • doing morton on 16x16 tiles with added offset is no different than doing morton globally on a 128x128 tile (see image below)

Mortong Example for 16x16 group:
thread 0: reads(0,0) at location 0 writes to pixelPos(0,0)
thread 1: reads(1,0) at location 1*ByteSize writes to pixelPos(1,0)
thread 2: reads(0,1) at location 2*ByteSize writes to pixelPos(0,1)
thread 3: reads(1,1) at location 3*ByteSize writes to pixelPos(1,1)
thread 4: reads(2,0) at location 4*ByteSize writes to pixelPos(2,0)
...
make sure this is what happes, reads are contigous, writes are morton ordered
might be actually easier to achieve this with single 1D 512 workgroup, not sure

Image

Copy link
Copy Markdown
Contributor

@Erfan-Ahmadi Erfan-Ahmadi Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw we're not going to go with morton benchmarking any further, but let's just fix it up and make it work as we first intended.

return throughputGBps;
}

BenchResult runBenchmarkCompute(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you probably need to pass in workgroup size here as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants