Skip to content

Commit 52fead5

Browse files
adam900710kdave
authored andcommitted
btrfs: introduce the device layout aware per-profile available space
[BUG] There is a long known bug that if metadata is using RAID1 on two disks with unbalanced sizes, there is a very high chance to hit ENOSPC related transaction abort. [CAUSE] The root cause is in the available space estimation code: - Factor based calculation Just use all unallocated space, divide by the profile factor One obvious user is can_overcommit(). This can not handle the following example: devid 1 unallocated: 1GiB devid 2 unallocated: 50GiB metadata type: RAID1 If using factor based estimation, we can use (1GiB + 50GiB) / 2 = 25.5GiB free space for metadata. Thus we can continue allocating metadata (over-commit) way beyond the 1GiB limit. But this estimation is completely wrong, in reality we can only allocate one single 1GiB RAID1 block group, thus if we continue over-commit, at one time we will hit ENOSPC at some critical path and flips the fs read-only. [SOLUTION] This patch will introduce per-profile available space estimation, which can provide chunk-allocator like behavior to give a (mostly) accurate result, with under-estimate corner cases. There are some differences between the estimation and real chunk allocator: - No consideration on hole size It's fine for most cases, as all data/metadata strips are in 1GiB size thus there should not be any hole wasting much space. And chunk allocator is able to use smaller stripes when there is really no other choice. Although in theory this means it can lead to some over-estimation, it should not cause too much hassle in the real world. The other benefit of such behavior is, we avoid dev-extent tree search completely, thus the overhead is very small. - No true balance for certain cases If we have 3 disks RAID1, and each device has 2GiB unallocated space, we can load balance the chunk allocation so that we can allocate 3GiB RAID1 chunks, and that's what chunk allocator will do. But this current estimation code is using the largest available space to do a single allocation. Meaning the estimation will be 2GiB, thus under estimate. Such under estimation is fine and after the first chunk allocation, the estimation will be updated and still give a correct 2GiB estimation. So this only means the estimation will be a little conservative, which is safer for call sites like metadata over-commit check. With that facility, for above 1GiB + 50GiB case, it will give a RAID1 estimation of 1GiB, instead of the incorrect 25.5GiB. Or for a more complex example: devid 1 unallocated: 1T devid 2 unallocated: 1T devid 3 unallocated: 10T We will get an array of: RAID10: 2T RAID1: 2T RAID1C3: 1T RAID1C4: 0 (not enough devices) DUP: 6T RAID0: 3T SINGLE: 12T RAID5: 2T RAID6: 1T [IMPLEMENTATION] And for the each profile , we go chunk allocator level calculation: The pseudo code looks like: clear_virtual_used_space_of_all_rw_devices(); do { /* * The same as chunk allocator, despite used space, * we also take virtual used space into consideration. */ sort_device_with_virtual_free_space(); /* * Unlike chunk allocator, we don't need to bother hole/stripe * size, so we use the smallest device to make sure we can * allocated as many stripes as regular chunk allocator */ stripe_size = device_with_smallest_free->avail_space; stripe_size = min(stripe_size, to_alloc / ndevs); /* * Allocate a virtual chunk, allocated virtual chunk will * increase virtual used space, allow next iteration to * properly emulate chunk allocator behavior. */ ret = alloc_virtual_chunk(stripe_size, &allocated_size); if (ret == 0) avail += allocated_size; } while (ret == 0) This minimal available space based calculation is not perfect, but the important part is, the estimation is never exceeding the real available space. This patch just introduces the infrastructure, no hooks are executed yet. Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Qu Wenruo <[email protected]> Signed-off-by: David Sterba <[email protected]>
1 parent 08ef566 commit 52fead5

2 files changed

Lines changed: 198 additions & 0 deletions

File tree

fs/btrfs/volumes.c

Lines changed: 164 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -392,6 +392,7 @@ static struct btrfs_fs_devices *alloc_fs_devices(const u8 *fsid)
392392
INIT_LIST_HEAD(&fs_devs->alloc_list);
393393
INIT_LIST_HEAD(&fs_devs->fs_list);
394394
INIT_LIST_HEAD(&fs_devs->seed_list);
395+
spin_lock_init(&fs_devs->per_profile_lock);
395396

396397
if (fsid) {
397398
memcpy(fs_devs->fsid, fsid, BTRFS_FSID_SIZE);
@@ -5387,6 +5388,169 @@ static int btrfs_cmp_device_info(const void *a, const void *b)
53875388
return 0;
53885389
}
53895390

5391+
/*
5392+
* Return 0 if we allocated any virtual(*) chunk, and restore the size to
5393+
* @allocated.
5394+
* Return -ENOSPC if we have no more space to allocate virtual chunk
5395+
*
5396+
* *: A virtual chunk is a chunk that only exists during per-profile available
5397+
* estimation.
5398+
* Those numbers won't really take on-disk space, but only to emulate
5399+
* chunk allocator behavior to get accurate estimation on available space.
5400+
*
5401+
* Another difference is, a virtual chunk has no size limit and doesn't care
5402+
* about holes in the device tree, allowing us to exhaust device space
5403+
* much faster.
5404+
*/
5405+
static int alloc_virtual_chunk(struct btrfs_fs_info *fs_info,
5406+
struct btrfs_device_info *devices_info,
5407+
enum btrfs_raid_types type,
5408+
u64 *allocated)
5409+
{
5410+
const struct btrfs_raid_attr *raid_attr = &btrfs_raid_array[type];
5411+
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
5412+
struct btrfs_device *device;
5413+
u64 stripe_size;
5414+
int ndevs = 0;
5415+
5416+
lockdep_assert_held(&fs_info->chunk_mutex);
5417+
5418+
/* Go through devices to collect their unallocated space. */
5419+
list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) {
5420+
u64 avail;
5421+
5422+
if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA,
5423+
&device->dev_state) ||
5424+
test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state))
5425+
continue;
5426+
5427+
if (device->total_bytes > device->bytes_used +
5428+
device->per_profile_allocated)
5429+
avail = device->total_bytes - device->bytes_used -
5430+
device->per_profile_allocated;
5431+
else
5432+
avail = 0;
5433+
5434+
avail = round_down(avail, fs_info->sectorsize);
5435+
5436+
/* And exclude the [0, 1M) reserved space. */
5437+
if (avail > BTRFS_DEVICE_RANGE_RESERVED)
5438+
avail -= BTRFS_DEVICE_RANGE_RESERVED;
5439+
else
5440+
avail = 0;
5441+
5442+
/*
5443+
* Not enough to support a single stripe, this device
5444+
* can not be utilized for chunk allocation.
5445+
*/
5446+
if (avail < BTRFS_STRIPE_LEN)
5447+
continue;
5448+
5449+
/*
5450+
* Unlike chunk allocator, we don't care about stripe or hole
5451+
* size, so here we use @avail directly.
5452+
*/
5453+
devices_info[ndevs].dev_offset = 0;
5454+
devices_info[ndevs].total_avail = avail;
5455+
devices_info[ndevs].max_avail = avail;
5456+
devices_info[ndevs].dev = device;
5457+
++ndevs;
5458+
}
5459+
sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
5460+
btrfs_cmp_device_info, NULL);
5461+
ndevs = rounddown(ndevs, raid_attr->devs_increment);
5462+
if (ndevs < raid_attr->devs_min)
5463+
return -ENOSPC;
5464+
if (raid_attr->devs_max)
5465+
ndevs = min(ndevs, (int)raid_attr->devs_max);
5466+
else
5467+
ndevs = min(ndevs, (int)BTRFS_MAX_DEVS(fs_info));
5468+
5469+
/*
5470+
* Stripe size will be determined by the device with the least
5471+
* unallocated space.
5472+
*/
5473+
stripe_size = devices_info[ndevs - 1].total_avail;
5474+
5475+
for (int i = 0; i < ndevs; i++)
5476+
devices_info[i].dev->per_profile_allocated += stripe_size;
5477+
*allocated = div_u64(stripe_size * (ndevs - raid_attr->nparity),
5478+
raid_attr->ncopies);
5479+
return 0;
5480+
}
5481+
5482+
static int calc_one_profile_avail(struct btrfs_fs_info *fs_info,
5483+
enum btrfs_raid_types type,
5484+
u64 *result_ret)
5485+
{
5486+
struct btrfs_device_info *devices_info = NULL;
5487+
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
5488+
struct btrfs_device *device;
5489+
u64 allocated;
5490+
u64 result = 0;
5491+
int ret = 0;
5492+
5493+
lockdep_assert_held(&fs_info->chunk_mutex);
5494+
ASSERT(type >= 0 && type < BTRFS_NR_RAID_TYPES);
5495+
5496+
/* Not enough devices, quick exit, just update the result. */
5497+
if (fs_devices->rw_devices < btrfs_raid_array[type].devs_min) {
5498+
ret = -ENOSPC;
5499+
goto out;
5500+
}
5501+
5502+
devices_info = kcalloc(fs_devices->rw_devices, sizeof(*devices_info),
5503+
GFP_NOFS);
5504+
if (!devices_info) {
5505+
ret = -ENOMEM;
5506+
goto out;
5507+
}
5508+
/* Clear virtual chunk used space for each device. */
5509+
list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)
5510+
device->per_profile_allocated = 0;
5511+
5512+
while (!alloc_virtual_chunk(fs_info, devices_info, type, &allocated))
5513+
result += allocated;
5514+
5515+
out:
5516+
kfree(devices_info);
5517+
if (ret < 0 && ret != -ENOSPC)
5518+
return ret;
5519+
*result_ret = result;
5520+
return 0;
5521+
}
5522+
5523+
/* Update the per-profile available space array. */
5524+
void btrfs_update_per_profile_avail(struct btrfs_fs_info *fs_info)
5525+
{
5526+
u64 results[BTRFS_NR_RAID_TYPES];
5527+
int ret;
5528+
5529+
/*
5530+
* Zoned is more complex as we can not simply get the amount of
5531+
* available space for each device.
5532+
*/
5533+
if (btrfs_is_zoned(fs_info))
5534+
goto error;
5535+
5536+
for (int i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
5537+
ret = calc_one_profile_avail(fs_info, i, &results[i]);
5538+
if (ret < 0)
5539+
goto error;
5540+
}
5541+
5542+
spin_lock(&fs_info->fs_devices->per_profile_lock);
5543+
for (int i = 0; i < BTRFS_NR_RAID_TYPES; i++)
5544+
fs_info->fs_devices->per_profile_avail[i] = results[i];
5545+
spin_unlock(&fs_info->fs_devices->per_profile_lock);
5546+
return;
5547+
error:
5548+
spin_lock(&fs_info->fs_devices->per_profile_lock);
5549+
for (int i = 0; i < BTRFS_NR_RAID_TYPES; i++)
5550+
fs_info->fs_devices->per_profile_avail[i] = U64_MAX;
5551+
spin_unlock(&fs_info->fs_devices->per_profile_lock);
5552+
}
5553+
53905554
static void check_raid56_incompat_flag(struct btrfs_fs_info *info, u64 type)
53915555
{
53925556
if (!(type & BTRFS_BLOCK_GROUP_RAID56_MASK))

fs/btrfs/volumes.h

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@
2222
#include <uapi/linux/btrfs_tree.h>
2323
#include "messages.h"
2424
#include "extent-io-tree.h"
25+
#include "fs.h"
2526

2627
struct block_device;
2728
struct bdev_handle;
@@ -213,6 +214,12 @@ struct btrfs_device {
213214

214215
/* Bandwidth limit for scrub, in bytes */
215216
u64 scrub_speed_max;
217+
218+
/*
219+
* A temporary number of allocated space during per-profile
220+
* available space calculation.
221+
*/
222+
u64 per_profile_allocated;
216223
};
217224

218225
/*
@@ -458,6 +465,15 @@ struct btrfs_fs_devices {
458465
/* Device to be used for reading in case of RAID1. */
459466
u64 read_devid;
460467
#endif
468+
469+
/*
470+
* Each value indicates the available space for that profile.
471+
* U64_MAX means the estimation is unavailable.
472+
*
473+
* Protected by per_profile_lock;
474+
*/
475+
u64 per_profile_avail[BTRFS_NR_RAID_TYPES];
476+
spinlock_t per_profile_lock;
461477
};
462478

463479
#define BTRFS_MAX_DEVS(info) ((BTRFS_MAX_ITEM_SIZE(info) \
@@ -887,6 +903,24 @@ int btrfs_bg_type_to_factor(u64 flags);
887903
const char *btrfs_bg_type_to_raid_name(u64 flags);
888904
int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info);
889905
bool btrfs_verify_dev_items(const struct btrfs_fs_info *fs_info);
906+
void btrfs_update_per_profile_avail(struct btrfs_fs_info *fs_info);
907+
908+
static inline bool btrfs_get_per_profile_avail(struct btrfs_fs_info *fs_info,
909+
u64 profile, u64 *avail_ret)
910+
{
911+
enum btrfs_raid_types index = btrfs_bg_flags_to_raid_index(profile);
912+
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
913+
bool uptodate = false;
914+
915+
spin_lock(&fs_devices->per_profile_lock);
916+
if (fs_devices->per_profile_avail[index] != U64_MAX) {
917+
uptodate = true;
918+
*avail_ret = fs_devices->per_profile_avail[index];
919+
}
920+
spin_unlock(&fs_info->fs_devices->per_profile_lock);
921+
return uptodate;
922+
}
923+
890924
bool btrfs_repair_one_zone(struct btrfs_fs_info *fs_info, u64 logical);
891925

892926
bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr);

0 commit comments

Comments
 (0)