Skip to content

Support sharded Parquet file querying and conversion#7610

Open
SungJin1212 wants to merge 3 commits into
cortexproject:masterfrom
SungJin1212:parquet-shard
Open

Support sharded Parquet file querying and conversion#7610
SungJin1212 wants to merge 3 commits into
cortexproject:masterfrom
SungJin1212:parquet-shard

Conversation

@SungJin1212

@SungJin1212 SungJin1212 commented Jun 9, 2026

Copy link
Copy Markdown
Member

This PR supports for querying sharded Parquet files within a bucket store and enables the conversion of sharded Parquet files.

Benchmark Results

Currently, the concurrency is hard-coded as 4.

GOROOT=/usr/local/opt/go/libexec #gosetup
GOPATH=/Users/kakao_ent/go #gosetup
/usr/local/opt/go/libexec/bin/go test -c -tags=slicelabels -o /Users/kakao_ent/Library/Caches/JetBrains/GoLand2026.1/tmp/GoLand/___1BenchmarkParquetBucketStore_MultiShard_in_github_com_cortexproject_cortex_pkg_storegateway.test github.com/cortexproject/cortex/pkg/storegateway #gosetup
/Users/kakao_ent/Library/Caches/JetBrains/GoLand2026.1/tmp/GoLand/___1BenchmarkParquetBucketStore_MultiShard_in_github_com_cortexproject_cortex_pkg_storegateway.test -test.v -test.paniconexit0 -test.bench ^\QBenchmarkParquetBucketStore_MultiShard\E$ -test.run ^$ #gosetup
goos: darwin
goarch: amd64
pkg: github.com/cortexproject/cortex/pkg/storegateway
cpu: VirtualApple @ 2.50GHz
BenchmarkParquetBucketStore_MultiShard
BenchmarkParquetBucketStore_MultiShard/shards=1
BenchmarkParquetBucketStore_MultiShard/shards=1-14         	      72	  15630539 ns/op	36701543 B/op	  282624 allocs/op
BenchmarkParquetBucketStore_MultiShard/shards=2
BenchmarkParquetBucketStore_MultiShard/shards=2-14         	     100	  11494683 ns/op	38358405 B/op	  284007 allocs/op
BenchmarkParquetBucketStore_MultiShard/shards=4
BenchmarkParquetBucketStore_MultiShard/shards=4-14         	     100	  10774228 ns/op	38830028 B/op	  286728 allocs/op
BenchmarkParquetBucketStore_MultiShard/shards=8
BenchmarkParquetBucketStore_MultiShard/shards=8-14         	     100	  11819611 ns/op	38578193 B/op	  291999 allocs/op
PASS

Process finished with the exit code 0

Which issue(s) this PR fixes:
Fixes #7176 #7174

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]
  • docs/configuration/v1-guarantees.md updated if this PR introduces experimental flags

@dosubot dosubot Bot added component/store-gateway go Pull requests that update Go code storage/blocks Blocks storage engine type/feature labels Jun 9, 2026
@SungJin1212 SungJin1212 force-pushed the parquet-shard branch 2 times, most recently from 635d72e to 524e917 Compare June 9, 2026 11:03
Signed-off-by: SungJin1212 <tjdwls1201@gmail.com>
Signed-off-by: SungJin1212 <tjdwls1201@gmail.com>
errGroup.SetLimit(p.concurrency)
for i := range shardBlockIDs {
errGroup.Go(func() error {
blk, err := p.newParquetBlock(egCtx, shardBlockIDs[i], shardIDs[i], bucketOpener, bucketOpener, p.chunksDecoder, p.rowRangesCache, noopQuota, noopQuota, noopQuota)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked at how this parquet sharding works for quite some time now... The sharding is to shard at columns... So do we really need to open all shards here? Or based on sharding do we know if we can only open 1 file is enough?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't tell which shard holds a matching series.. So we need to open all shard files. (the converter mark only stores the shard count, with no per-shard label metadata)

# splits a block into more parquet shards for better read parallelization.
# Default is unlimited (single shard).
# CLI flag: -parquet-converter.num-row-groups
[num_row_groups: <int> | default = 2147483647]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have an integration test with sharding enabled?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an e2e test.

Signed-off-by: SungJin1212 <tjdwls1201@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/store-gateway go Pull requests that update Go code size/XL storage/blocks Blocks storage engine type/feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Parquet] Support sharded parquet file conversion

2 participants