Commit 52fead5
btrfs: introduce the device layout aware per-profile available space
[BUG]
There is a long known bug that if metadata is using RAID1 on two disks
with unbalanced sizes, there is a very high chance to hit ENOSPC related
transaction abort.
[CAUSE]
The root cause is in the available space estimation code:
- Factor based calculation
Just use all unallocated space, divide by the profile factor
One obvious user is can_overcommit().
This can not handle the following example:
devid 1 unallocated: 1GiB
devid 2 unallocated: 50GiB
metadata type: RAID1
If using factor based estimation, we can use (1GiB + 50GiB) / 2 = 25.5GiB
free space for metadata.
Thus we can continue allocating metadata (over-commit) way beyond the
1GiB limit.
But this estimation is completely wrong, in reality we can only allocate
one single 1GiB RAID1 block group, thus if we continue over-commit, at
one time we will hit ENOSPC at some critical path and flips the fs
read-only.
[SOLUTION]
This patch will introduce per-profile available space estimation,
which can provide chunk-allocator like behavior to give a (mostly)
accurate result, with under-estimate corner cases.
There are some differences between the estimation and real chunk
allocator:
- No consideration on hole size
It's fine for most cases, as all data/metadata strips are in 1GiB size
thus there should not be any hole wasting much space.
And chunk allocator is able to use smaller stripes when there is
really no other choice.
Although in theory this means it can lead to some over-estimation, it
should not cause too much hassle in the real world.
The other benefit of such behavior is, we avoid dev-extent tree search
completely, thus the overhead is very small.
- No true balance for certain cases
If we have 3 disks RAID1, and each device has 2GiB unallocated space,
we can load balance the chunk allocation so that we can allocate 3GiB
RAID1 chunks, and that's what chunk allocator will do.
But this current estimation code is using the largest available space
to do a single allocation. Meaning the estimation will be 2GiB, thus
under estimate.
Such under estimation is fine and after the first chunk allocation, the
estimation will be updated and still give a correct 2GiB
estimation.
So this only means the estimation will be a little conservative, which
is safer for call sites like metadata over-commit check.
With that facility, for above 1GiB + 50GiB case, it will give a RAID1
estimation of 1GiB, instead of the incorrect 25.5GiB.
Or for a more complex example:
devid 1 unallocated: 1T
devid 2 unallocated: 1T
devid 3 unallocated: 10T
We will get an array of:
RAID10: 2T
RAID1: 2T
RAID1C3: 1T
RAID1C4: 0 (not enough devices)
DUP: 6T
RAID0: 3T
SINGLE: 12T
RAID5: 2T
RAID6: 1T
[IMPLEMENTATION]
And for the each profile , we go chunk allocator level calculation:
The pseudo code looks like:
clear_virtual_used_space_of_all_rw_devices();
do {
/*
* The same as chunk allocator, despite used space,
* we also take virtual used space into consideration.
*/
sort_device_with_virtual_free_space();
/*
* Unlike chunk allocator, we don't need to bother hole/stripe
* size, so we use the smallest device to make sure we can
* allocated as many stripes as regular chunk allocator
*/
stripe_size = device_with_smallest_free->avail_space;
stripe_size = min(stripe_size, to_alloc / ndevs);
/*
* Allocate a virtual chunk, allocated virtual chunk will
* increase virtual used space, allow next iteration to
* properly emulate chunk allocator behavior.
*/
ret = alloc_virtual_chunk(stripe_size, &allocated_size);
if (ret == 0)
avail += allocated_size;
} while (ret == 0)
This minimal available space based calculation is not perfect, but the
important part is, the estimation is never exceeding the real available
space.
This patch just introduces the infrastructure, no hooks are executed
yet.
Reviewed-by: Filipe Manana <[email protected]>
Signed-off-by: Qu Wenruo <[email protected]>
Signed-off-by: David Sterba <[email protected]>1 parent 08ef566 commit 52fead5
2 files changed
Lines changed: 198 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
392 | 392 | | |
393 | 393 | | |
394 | 394 | | |
| 395 | + | |
395 | 396 | | |
396 | 397 | | |
397 | 398 | | |
| |||
5387 | 5388 | | |
5388 | 5389 | | |
5389 | 5390 | | |
| 5391 | + | |
| 5392 | + | |
| 5393 | + | |
| 5394 | + | |
| 5395 | + | |
| 5396 | + | |
| 5397 | + | |
| 5398 | + | |
| 5399 | + | |
| 5400 | + | |
| 5401 | + | |
| 5402 | + | |
| 5403 | + | |
| 5404 | + | |
| 5405 | + | |
| 5406 | + | |
| 5407 | + | |
| 5408 | + | |
| 5409 | + | |
| 5410 | + | |
| 5411 | + | |
| 5412 | + | |
| 5413 | + | |
| 5414 | + | |
| 5415 | + | |
| 5416 | + | |
| 5417 | + | |
| 5418 | + | |
| 5419 | + | |
| 5420 | + | |
| 5421 | + | |
| 5422 | + | |
| 5423 | + | |
| 5424 | + | |
| 5425 | + | |
| 5426 | + | |
| 5427 | + | |
| 5428 | + | |
| 5429 | + | |
| 5430 | + | |
| 5431 | + | |
| 5432 | + | |
| 5433 | + | |
| 5434 | + | |
| 5435 | + | |
| 5436 | + | |
| 5437 | + | |
| 5438 | + | |
| 5439 | + | |
| 5440 | + | |
| 5441 | + | |
| 5442 | + | |
| 5443 | + | |
| 5444 | + | |
| 5445 | + | |
| 5446 | + | |
| 5447 | + | |
| 5448 | + | |
| 5449 | + | |
| 5450 | + | |
| 5451 | + | |
| 5452 | + | |
| 5453 | + | |
| 5454 | + | |
| 5455 | + | |
| 5456 | + | |
| 5457 | + | |
| 5458 | + | |
| 5459 | + | |
| 5460 | + | |
| 5461 | + | |
| 5462 | + | |
| 5463 | + | |
| 5464 | + | |
| 5465 | + | |
| 5466 | + | |
| 5467 | + | |
| 5468 | + | |
| 5469 | + | |
| 5470 | + | |
| 5471 | + | |
| 5472 | + | |
| 5473 | + | |
| 5474 | + | |
| 5475 | + | |
| 5476 | + | |
| 5477 | + | |
| 5478 | + | |
| 5479 | + | |
| 5480 | + | |
| 5481 | + | |
| 5482 | + | |
| 5483 | + | |
| 5484 | + | |
| 5485 | + | |
| 5486 | + | |
| 5487 | + | |
| 5488 | + | |
| 5489 | + | |
| 5490 | + | |
| 5491 | + | |
| 5492 | + | |
| 5493 | + | |
| 5494 | + | |
| 5495 | + | |
| 5496 | + | |
| 5497 | + | |
| 5498 | + | |
| 5499 | + | |
| 5500 | + | |
| 5501 | + | |
| 5502 | + | |
| 5503 | + | |
| 5504 | + | |
| 5505 | + | |
| 5506 | + | |
| 5507 | + | |
| 5508 | + | |
| 5509 | + | |
| 5510 | + | |
| 5511 | + | |
| 5512 | + | |
| 5513 | + | |
| 5514 | + | |
| 5515 | + | |
| 5516 | + | |
| 5517 | + | |
| 5518 | + | |
| 5519 | + | |
| 5520 | + | |
| 5521 | + | |
| 5522 | + | |
| 5523 | + | |
| 5524 | + | |
| 5525 | + | |
| 5526 | + | |
| 5527 | + | |
| 5528 | + | |
| 5529 | + | |
| 5530 | + | |
| 5531 | + | |
| 5532 | + | |
| 5533 | + | |
| 5534 | + | |
| 5535 | + | |
| 5536 | + | |
| 5537 | + | |
| 5538 | + | |
| 5539 | + | |
| 5540 | + | |
| 5541 | + | |
| 5542 | + | |
| 5543 | + | |
| 5544 | + | |
| 5545 | + | |
| 5546 | + | |
| 5547 | + | |
| 5548 | + | |
| 5549 | + | |
| 5550 | + | |
| 5551 | + | |
| 5552 | + | |
| 5553 | + | |
5390 | 5554 | | |
5391 | 5555 | | |
5392 | 5556 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
22 | 22 | | |
23 | 23 | | |
24 | 24 | | |
| 25 | + | |
25 | 26 | | |
26 | 27 | | |
27 | 28 | | |
| |||
213 | 214 | | |
214 | 215 | | |
215 | 216 | | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
216 | 223 | | |
217 | 224 | | |
218 | 225 | | |
| |||
458 | 465 | | |
459 | 466 | | |
460 | 467 | | |
| 468 | + | |
| 469 | + | |
| 470 | + | |
| 471 | + | |
| 472 | + | |
| 473 | + | |
| 474 | + | |
| 475 | + | |
| 476 | + | |
461 | 477 | | |
462 | 478 | | |
463 | 479 | | |
| |||
887 | 903 | | |
888 | 904 | | |
889 | 905 | | |
| 906 | + | |
| 907 | + | |
| 908 | + | |
| 909 | + | |
| 910 | + | |
| 911 | + | |
| 912 | + | |
| 913 | + | |
| 914 | + | |
| 915 | + | |
| 916 | + | |
| 917 | + | |
| 918 | + | |
| 919 | + | |
| 920 | + | |
| 921 | + | |
| 922 | + | |
| 923 | + | |
890 | 924 | | |
891 | 925 | | |
892 | 926 | | |
| |||
0 commit comments