Skip to content

Commit 9228990

Browse files
authored
Merge pull request #253407 from mulander/schema-based-sharding
Schema based sharding
2 parents 9e794c7 + e0ddb0c commit 9228990

21 files changed

Lines changed: 665 additions & 233 deletions

articles/cosmos-db/postgresql/TOC.yml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,9 @@
6262
- name: Design a real-time dashboard
6363
href: tutorial-design-database-realtime.md
6464
displayName: tutorial, real-time
65+
- name: Design for microservices
66+
href: tutorial-design-database-microservices.md
67+
displayName: tutorial, microservices
6568
- name: Administer
6669
items:
6770
- name: Set up private access
@@ -71,6 +74,8 @@
7174
items:
7275
- name: Clusters
7376
href: concepts-cluster.md
77+
- name: Sharding models
78+
href: concepts-sharding-models.md
7479
- name: Distributed data
7580
items:
7681
- name: Nodes and tables

articles/cosmos-db/postgresql/concepts-colocation.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ ms.service: cosmos-db
77
ms.subservice: postgresql
88
ms.custom: ignite-2022
99
ms.topic: conceptual
10-
ms.date: 05/06/2019
10+
ms.date: 10/01/2023
1111
---
1212

1313
# Table colocation in Azure Cosmos DB for PostgreSQL
@@ -18,13 +18,13 @@ Colocation means storing related information together on the same nodes. Queries
1818

1919
## Data colocation for hash-distributed tables
2020

21-
In Azure Cosmos DB for PostgreSQL, a row is stored in a shard if the hash of the value in the distribution column falls within the shard's hash range. Shards with the same hash range are always placed on the same node. Rows with equal distribution column values are always on the same node across tables.
21+
In Azure Cosmos DB for PostgreSQL, a row is stored in a shard if the hash of the value in the distribution column falls within the shard's hash range. Shards with the same hash range are always placed on the same node. Rows with equal distribution column values are always on the same node across tables. The concept of hash-distributed tables is also known as [row-based sharding](concepts-sharding-models.md#row-based-sharding). In [schema-based sharding](concepts-sharding-models.md#schema-based-sharding), tables within a distributed schema are always colocated.
2222

2323
:::image type="content" source="media/concepts-colocation/colocation-shards.png" alt-text="Diagram shows shards with the same hash range placed on the same node for events shards and page shards." border="false":::
2424

2525
## A practical example of colocation
2626

27-
Consider the following tables that might be part of a multi-tenant web
27+
Consider the following tables that might be part of a multitenant web
2828
analytics SaaS:
2929

3030
```sql
@@ -153,4 +153,4 @@ In some cases, queries and table schemas must be changed to include the tenant I
153153

154154
## Next steps
155155

156-
- See how tenant data is colocated in the [multi-tenant tutorial](tutorial-design-database-multi-tenant.md).
156+
- See how tenant data is colocated in the [multitenant tutorial](tutorial-design-database-multi-tenant.md).

articles/cosmos-db/postgresql/concepts-distributed-data.md

Lines changed: 0 additions & 123 deletions
This file was deleted.

articles/cosmos-db/postgresql/concepts-nodes.md

Lines changed: 17 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ author: jonels-msft
66
ms.service: cosmos-db
77
ms.subservice: postgresql
88
ms.topic: conceptual
9-
ms.date: 10/26/2022
9+
ms.date: 09/29/2023
1010
---
1111

1212
# Nodes and tables in Azure Cosmos DB for PostgreSQL
@@ -25,23 +25,22 @@ allows the database to scale by adding more nodes to the cluster.
2525

2626
Every cluster has a coordinator node and multiple workers. Applications
2727
send their queries to the coordinator node, which relays it to the relevant
28-
workers and accumulates their results. Applications are not able to connect
29-
directly to workers.
28+
workers and accumulates their results.
3029

31-
Azure Cosmos DB for PostgreSQL allows the database administrator to *distribute* tables,
32-
storing different rows on different worker nodes. Distributed tables are the
33-
key to Azure Cosmos DB for PostgreSQL performance. Failing to distribute tables leaves them entirely
34-
on the coordinator node and cannot take advantage of cross-machine parallelism.
30+
Azure Cosmos DB for PostgreSQL allows the database administrator to *distribute* tables and/or schemas,
31+
storing different rows on different worker nodes. Distributed tables and/or schemas are the
32+
key to Azure Cosmos DB for PostgreSQL performance. Failing to distribute tables and/or schemas leaves them entirely
33+
on the coordinator node and can't take advantage of cross-machine parallelism.
3534

3635
For each query on distributed tables, the coordinator either routes it to a
3736
single worker node, or parallelizes it across several depending on whether the
38-
required data lives on a single node or multiple. The coordinator decides what
37+
required data lives on a single node or multiple. With [schema-based sharding](concepts-sharding-models.md#schema-based-sharding), the coordinator routes the queries directly to the node that hosts the schema. In both schema-based sharding and [row-based sharding](concepts-sharding-models.md#row-based-sharding), the coordinator decides what
3938
to do by consulting metadata tables. These tables track the DNS names and
4039
health of worker nodes, and the distribution of data across nodes.
4140

4241
## Table types
4342

44-
There are three types of tables in a cluster, each
43+
There are five types of tables in a cluster, each
4544
stored differently on nodes and used for different purposes.
4645

4746
### Type 1: Distributed tables
@@ -77,7 +76,15 @@ values like order statuses or product categories.
7776

7877
When you use Azure Cosmos DB for PostgreSQL, the coordinator node you connect to is a regular PostgreSQL database. You can create ordinary tables on the coordinator and choose not to shard them.
7978

80-
A good candidate for local tables would be small administrative tables that don't participate in join queries. An example is a users table for application sign-in and authentication.
79+
A good candidate for local tables would be small administrative tables that don't participate in join queries. An example is a `users` table for application sign-in and authentication.
80+
81+
### Type 4: Local managed tables
82+
83+
Azure Cosmos DB for PostgreSQL might automatically add local tables to metadata if a foreign key reference exists between a local table and a reference table. Additionally locally managed tables can be manually created by executing [create_reference_table](reference-functions.md#citus_add_local_table_to_metadata) citus_add_local_table_to_metadata function on regular local tables. Tables present in metadata are considered managed tables and can be queried from any node, Citus knows to route to the coordinator to obtain data from the local managed table. Such tables are displayed as local in [citus_tables](reference-metadata.md#distributed-tables-view) view.
84+
85+
### Type 5: Schema tables
86+
87+
With [schema-based sharding](concepts-sharding-models.md#schema-based-sharding) introduced in Citus 12.0, distributed schemas are automatically associated with individual colocation groups. Tables created in those schemas are automatically converted to colocated distributed tables without a shard key. Such tables are considered schema tables and are displayed as schema in [citus_tables](reference-metadata.md#distributed-tables-view) view.
8188

8289
## Shards
8390

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
---
2+
title: Sharding models - Azure Cosmos DB for PostgreSQL
3+
description: What is sharding, and what sharding models are available in Azure Cosmos DB for PostgreSQL
4+
ms.author: adamwolk
5+
author: mulander
6+
ms.service: cosmos-db
7+
ms.subservice: postgresql
8+
ms.topic: conceptual
9+
ms.date: 09/08/2023
10+
---
11+
12+
# Sharding models
13+
14+
[!INCLUDE [PostgreSQL](../includes/appliesto-postgresql.md)]
15+
16+
Sharding is a technique used in database systems and distributed computing to horizontally partition data across multiple servers or nodes. It involves breaking up a large database or dataset into smaller, more manageable parts called Shards. A shard contains a subset of the data, and together shards form the complete dataset.
17+
18+
Azure Cosmos DB for PostgreSQL offers two types of data sharding, namely row-based and schema-based. Each option comes with its own [Sharding tradeoffs](#sharding-tradeoffs), allowing you to choose the approach that best aligns with your application's requirements.
19+
20+
## Row-based sharding
21+
22+
The traditional way in which Azure Cosmos DB for PostgreSQL shards tables is the single database, shared schema model also known as row-based sharding, tenants coexist as rows within the same table. The tenant is determined by defining a [distribution column](./concepts-nodes.md#distribution-column), which allows splitting up a table horizontally.
23+
24+
Row-based is the most hardware efficient way of sharding. Tenants are densely packed and distributed among the nodes in the cluster. This approach however requires making sure that all tables in the schema have the distribution column and that all queries in the application filter by it. Row-based sharding shines in IoT workloads and for achieving the best margin out of hardware use.
25+
26+
Benefits:
27+
28+
* Best performance
29+
* Best tenant density per node
30+
31+
Drawbacks:
32+
33+
* Requires schema modifications
34+
* Requires application query modifications
35+
* All tenants must share the same schema
36+
37+
## Schema-based sharding
38+
39+
Available with Citus 12.0 in Azure Cosmos DB for PostgreSQL, schema-based sharding is the shared database, separate schema model, the schema becomes the logical shard within the database. Multitenant apps can use a schema per tenant to easily shard along the tenant dimension. Query changes aren't required and the application only needs a small modification to set the proper search_path when switching tenants. Schema-based sharding is an ideal solution for microservices, and for ISVs deploying applications that can't undergo the changes required to onboard row-based sharding.
40+
41+
Benefits:
42+
43+
* Tenants can have heterogeneous schemas
44+
* No schema modifications required
45+
* No application query modifications required
46+
* Schema-based sharding SQL compatibility is better compared to row-based sharding
47+
48+
Drawbacks:
49+
50+
* Fewer tenants per node compared to row-based sharding
51+
52+
## Sharding tradeoffs
53+
54+
<br />
55+
56+
|| Schema-based sharding | Row-based sharding|
57+
|---|---|---|
58+
|Multi-tenancy model|Separate schema per tenant|Shared tables with tenant ID columns|
59+
|Citus version|12.0+|All versions|
60+
|Extra steps compared to vanilla PostgreSQL|None, only a config change|Use create_distributed_table on each table to distribute & colocate tables by tenant ID|
61+
|Number of tenants|1-10k|1-1 M+|
62+
|Data modeling requirement|No foreign keys across distributed schemas|Need to include a tenant ID column (a distribution column, also known as a sharding key) in each table, and in primary keys, foreign keys|
63+
|SQL requirement for single node queries|Use a single distributed schema per query|Joins and WHERE clauses should include tenant_id column|
64+
|Parallel cross-tenant queries|No|Yes|
65+
|Custom table definitions per tenant|Yes|No|
66+
|Access control|Schema permissions|Schema permissions|
67+
|Data sharing across tenants|Yes, using reference tables (in a separate schema)|Yes, using reference tables|
68+
|Tenant to shard isolation|Every tenant has its own shard group by definition|Can give specific tenant IDs their own shard group via isolate_tenant_to_new_shard|

articles/cosmos-db/postgresql/concepts-upgrade.md

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ author: jonels-msft
66
ms.service: cosmos-db
77
ms.subservice: postgresql
88
ms.topic: conceptual
9-
ms.date: 05/16/2023
9+
ms.date: 10/01/2023
1010
---
1111

1212
# Cluster upgrades in Azure Cosmos DB for PostgreSQL
@@ -16,7 +16,7 @@ ms.date: 05/16/2023
1616
The Azure Cosmos DB for PostgreSQL managed service can handle upgrades of both the
1717
PostgreSQL server, and the Citus extension. All clusters are created with [the latest Citus version](./reference-extensions.md#citus-extension) available for the major PostgreSQL version you select during cluster provisioning. When you select a PostgreSQL version such as PostgreSQL 15 for in-place cluster upgrade, the latest Citus version supported for selected PostgreSQL version is going to be installed.
1818

19-
If you need to upgrade the Citus version only, you can do so by using an in-place upgrade. For instance, you may want to upgrade Citus 11.0 to Citus 11.3 on your PostgreSQL 14 cluster without upgrading Postgres version.
19+
If you need to upgrade the Citus version only, you can do so by using an in-place upgrade. For instance, you might want to upgrade Citus 11.0 to Citus 11.3 on your PostgreSQL 14 cluster without upgrading Postgres version.
2020

2121
## Upgrade precautions
2222

@@ -30,10 +30,14 @@ Also, upgrading a major version of Citus can introduce changes in behavior.
3030
It's best to familiarize yourself with new product features and changes to
3131
avoid surprises.
3232

33+
Noteworthy Citus 12 changes:
34+
* The default rebalance strategy changed from `by_shard_count` to `by_disk_size`.
35+
* Support for PostgreSQL 13 has been dropped as of this version.
36+
3337
Noteworthy Citus 11 changes:
3438

35-
* Table shards may disappear in your SQL client. Their visibility
36-
is now controlled by
39+
* Table shards might disappear in your SQL client. You can control their visibility
40+
using
3741
[citus.show_shards_for_app_name_prefixes](reference-parameters.md#citusshow_shards_for_app_name_prefixes-text).
3842
* There are several [deprecated
3943
features](https://www.citusdata.com/updates/v11-0/#deprecated-features).

articles/cosmos-db/postgresql/howto-scale-grow.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ queries.
3838
> [!NOTE]
3939
> To take advantage of newly added nodes you must [rebalance distributed table
4040
> shards](howto-scale-rebalance.md), which means moving some
41-
> [shards](concepts-distributed-data.md#shards) from existing nodes
41+
> [shards](concepts-nodes.md#shards) from existing nodes
4242
> to the new ones. Rebalancing can work in the background, and requires no
4343
> downtime.
4444

articles/cosmos-db/postgresql/howto-scale-rebalance.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ ms.date: 01/30/2023
1515
[!INCLUDE [PostgreSQL](../includes/appliesto-postgresql.md)]
1616

1717
To take advantage of newly added nodes, rebalance distributed table
18-
[shards](concepts-distributed-data.md#shards). Rebalancing moves shards from existing nodes to the new ones. Azure Cosmos DB for PostgreSQL offers
18+
[shards](concepts-nodes.md#shards). Rebalancing moves shards from existing nodes to the new ones. Azure Cosmos DB for PostgreSQL offers
1919
zero-downtime rebalancing, meaning queries continue without interruption during
2020
shard rebalancing.
2121

0 commit comments

Comments
 (0)