跳到主要内容

Uniffle Release 0.7.0

Highlight

  • Better support for Spark AQE
    • Leveraging the LOCAL_ORDER data distribution to improve performance
    • Estimating assignment number of shuffle servers to improve user experience
    • Better assignment policy of assigning adjacent partitions to the same shuffle server to improve performance
  • Optimization of huge partition to improve stability of shuffle servers
  • More bug fixes and usability improvements of K8S operator
  • Add support of user quota management and more compression algorithms
  • Add support of spark data eviction mechanism of stage level
  • Some improvement of stability and performance

ChangeLog

  • Add more badges in README by @kaijchen in #219
  • Fix incorrect log format strings by @kaijchen in #220
  • Change total lines badge url to sloc.xyz in README by @kaijchen in #222
  • [MINOR] Fix warnings reported by lgtm by @kaijchen in #223
  • [MINOR] Simplify creating buffer logic by @zuston in #227
  • Support cancelling previous ci actions by @zuston in #225
  • Use the conf of shuffleNodesNumber from jobs to be as checking factor by @zuston in #208
  • Output the stderr and stdout to output file in startup script by @zuston in #226
  • [ISSUE-48][FEATURE][FOLLOW UP] Add controller component by @wangao1236 in #214
  • Add more metrics about requiring read memory by @zuston in #231
  • Adjust the memory required times to match grpc max deadline conf by @zuston in #218
  • [MINOR] Fix flaky test by @jerqi in #238
  • [ISSUE-48][FEATURE][FOLLOW UP] Add yaml of components and crd exampes by @wangao1236 in #236
  • Fix Flaky test GetShuffleReportForMultiPartTest by @leixm in #241
  • Set the default disk capacity to the total space by @zuston in #237
  • Add issue template by @jerqi in #8
  • [MINOR] Fix inefficient map iteration by @kaijchen in #245
  • Support deploy multiple shuffle servers in a single node by @xianjingfeng in #166
  • Fast fail when reading failed in ComposedClientReadHandler by @zuston in #213
  • Fix startup shell problem by @jerqi in #251
  • New version 0.7.0-snapshot by @jerqi in #252
  • [ISSUE-196] Fix flaky test about kerberos by @zuston in #250
  • [ISSUE-48][FEATURE][FOLLOW UP] add unit test for validating rss objects by @wangao1236 in #248
  • [Improvement] Add hdfs path health check to AppBalanceSelectStorageStrategy by @smallzhongfeng in #210
  • [TYPO] Replace Chinese colon by ASCII colon by @kaijchen in #255
  • Introduce startup-silent-period mechanism to avoid partial assignments by @zuston in #247
  • Replace DISCLAIMER with DISCLAIMER-WIP by @jerqi in #258
  • [ISSUE-244] Fix flaky test of CoordinatorGrpcTest.rpcMetricsTest by @zuston in #256
  • Fix flaky test of ClientConfManagerTest by @smallzhongfeng in #260
  • [Refactor] Optimize creating shuffle handlers by @zuston in #259
  • Introduce data cleanup mechanism on stage level by @zuston in #249
  • [ISSUE-48][FEATURE][FOLLOW UP] add docs for operator by @wangao1236 in #261
  • Fix potenial missing reads of exclude nodes by @zuston in #269
  • [ISSUE-257] RssMRUtils#getBlockId change the partitionId of int type to long by @fpkgithub in #266
  • [ISSUE-273][BUG] Get shuffle result failed caused by concurrent calls to registerShuffle by @leixm in #274
  • Add enum type test about case insensitive by @zuston in #280
  • Support ZSTD by @zuston in #254
  • [ISSUE-239][BUG] RssUtils#transIndexDataToSegments should consider the length of the data file by @leixm in #275
  • Remove code quality badge and add release badge by @kaijchen in #284
  • [ISSUE-163][FEATURE] Write to hdfs when local disk can't be write by @xianjingfeng in #235
  • Upgrade Github actions for Node.js 16 by @kaijchen in #292
  • Fix NPE in WriteBufferManager.addRecord by @wForget in #296
  • Fix AbstractStorage#containsWriteHandler by @xianjingfeng in #281
  • Add more test cases on LocalStorageManager.selectStorage by @zuston in #298
  • [ISSUE-137][Improvement][AQE] Sort MapId before the data are flushed by @zuston in #293
  • [ISSUE-283][FEATURE] Support snappy compression/decompression by @amaliujia in #304
  • [ISSUE-290] Make RpcNodePort and HttpNodePort optional by @amaliujia in #305
  • [ISSUE-301][Subtask][Improvement][AQE] Merge continuous ShuffleDataSegment into single one by @zuston in #303
  • Cleanup RuntimeException and fetchRemoteStorage logic in ClientUtils by @kaijchen in #295
  • [ISSUE-135][FOLLOWUP][Improvement][AQE] Assign adjacent partitions to the same ShuffleServer by @leixm in #307
  • Correct the contributing guide link in pull-request template by @zuston in #314
  • Fix bug of "Comparison method violates its general contract" by @zuston in #315
  • [AQE][LocalOrder] Fix potenial bug when merging continuous segments by @zuston in #318
  • [AQE][LocalOrder] Fix wrong param of expectedTaskIds in LocalOrderSegmentSplit by @zuston in #319
  • [Feature] Support the estimated number of ShuffleServers required. by @leixm in #322
  • [Bug] Fix potenial bug when the index reading offset is greater than data length by @zuston in #320
  • [ISSUE-154][Improvement] Support Empty assignment to Shuffle Server by @rhh777 in #325
  • [Bug] Fix invalid owner of host path volumes by @wangao1236 in #330
  • [ISSUE-309][FEATURE] Support ShuffleServer latency metrics. by @leixm in #327
  • [ISSUE-329]Catch NPE in ShuffleTaskManager#addFinishedBlockIds by @xianjingfeng in #331
  • [BUG] Fix wrong method name by @leixm in #335
  • [ISSUE-328] Cleanup unused shuffle servers after stage completed by @xianjingfeng in #334
  • [MINOR] Migrate RankValue to the package of the common class by @smallzhongfeng in #265
  • [BUG] Fix incorrect spark metrics by @zuston in #324
  • [Improvement][LocalOrder] Add tests about keeping consistent with FixedSize when no skew optimization by @zuston in #336
  • [INFRA] Add k8s pipeline by @jerqi in #340
  • Remove unused class of RssShuffleUtils by @zuston in #345
  • [ISSUE-342][Improvement] Check Spark Serializer type by @chong0929 in #344
  • [Feature] Support user's app quota level limit by @smallzhongfeng in #311
  • [BUG][AQE][LocalOrder] Fix the bug of missed data due to block sorting by @zuston in #347
  • [ISSUE-364] Fix indexWriter don't close if exception thrown when close dataWriter by @xianjingfeng in #349
  • [BUG] Fix flaky test of AQESkewedJoinWithLocalOrderTest by @zuston in #350
  • Add collaborators by @jerqi in #351
  • [BUG][FOLLOWUP] Fix flaky test of AQESkewedJoinWithLocalOrderTest by @zuston in #355
  • [BUG][AQE][LocalOrder] Remove check of discontinuous map task ids by @zuston in #354
  • [Improvement] Task fast fail once blocks fail to send by @zuston in #332
  • [ISSUE-228][FEATURE] Add a period local storage cleaner thread by @sfwang218 in #357
  • [Improvement] Optimize the use of QuotaManager by @smallzhongfeng in #359
  • [ISSUE-339] Optimize retry logic in send shuffle data by @xianjingfeng in #361
  • [ISSUE-300] Make config type of RSS_CLIENT_TYPE as enum by @selectbook in #310
  • [ISSUE-228] Fix the problem of protobuf-java incorrect dependency at compile time by @tsface in #362
  • [ISSUE-124] Add fallback mechanism for blocks read inconsistent by @xianjingfeng in #276
  • [BUG] Fix potential memory leak when encountering disk unhealthy by @zuston in #370
  • [Improvement][AQE] Support getting memory data skip by upstream task ids by @zuston in #358
  • [ISSUE-369] Don't throw exception if blocks are corrupted but have multi replicas by @xianjingfeng in #374
  • [ISSUE-285][Improvement] Only use HDFS and LOCALFILE storageType in the test by @tiantingting5435 in #360
  • Fix typo of PreferDiffHostAssignmentStrategy by @zuston in #379
  • [ISSUE-376] Fix concurrency problems may occur when the ApplicationManager register app by @smallzhongfeng in #382
  • [ISSUE-380] Refactor the flush process to fix fallback fail by @zuston in #383
  • [Refactor] Make coordinator class more organized by @smallzhongfeng in #386
  • [ISSUE-392] Fix the bug in the shuffle data cleanup checker that causes false reports of disk corruption by @zuston in #393
  • [ISSUE-390] Print more infos after read finished by @xianjingfeng in #395
  • [Improvement] Skip blocks when read from memory by @xianjingfeng in #294
  • [Improvement] Small refactor for code quality by @advancedxy in #394
  • [BUG] Fix incorrect block info statistics after read finished by @xianjingfeng in #401
  • Revert "[Improvement] Skip blocks when read from memory (#294)" by @xianjingfeng in #403
  • [ISSUE-388][ISSUE-244][Bug] Fix incorrect usage of GRPCMetrics#setGauge by @xianjingfeng in #404
  • [MINOR] If there is no data flush to hdfs, return directly instead of throw exception by @xianjingfeng in #406
  • Support writing multi files of single partition to improve speed in HDFS storage by @zuston in #396
  • Fix incorrect metrics of event_queue_size and total_write_handler by @zuston in #411
  • [Improvement] Support skip memory data when use multiple replicas by @xianjingfeng in #400
  • [ISSUE-402] Flaky Test: QuorumTest#case1 by @Rembrant777 in #422
  • Fix flaky test QuotaManagerTest#testDetectUserResource by @xianjingfeng in #421
  • Improve README by @kaijchen in #427
  • Remove unused data structure and method by @zuston in #429
  • [FOLLOWUP] Remove unused methods in Storage interface by @zuston in #431
  • Flaky Test: AppBalanceSelectStorageStrategyTest#selectStorageTest by @smallzhongfeng in #438
  • [Minor] Move GrpcServerTest to common.rpc package by @kaijchen in #439
  • [Minor] refactor test code by @advancedxy in #432
  • [Minor][Improvement] Introduce FileWriter interface for Localfile/HDFS file writer by @zuston in #444
  • [ISSUE-169] Support metric reporter and Support promethues push gateway by @xianjingfeng in #415
  • [Improvement] Avoid selecting storage which has reached the high watermark by @zuston in #424
  • [Bug] Fix potential negative preAllocatedSize variable by @advancedxy in #428
  • [Feature] add a configuration to control shuffle data flush by @advancedxy in #445
  • [Improvement] Refactor getPartitionRange to calculate range directly by @a-li in #447
  • Fix Flaky Test: AppBalanceSelectStorageStrategyTest#selectStorageTest by @smallzhongfeng in #450
  • Fix Flaky Test: QuotaManagerTest#testDetectUserResource by @smallzhongfeng in #453
  • [ISSUE-451][Improvement] Read HDFS data files with random sequence to distribute pressure by @zuston in #452
  • [ISSUE-455] Lazily create uncompressedData by @xianjingfeng in #457
  • [ISSUE-378][HugePartition][Part-1] Record every partition data size for one app by @zuston in #458
  • [ISSUE-456] Avoid removeResources for multiple times by @xianjingfeng in #459
  • [ISSUE-461] Support Spark 3.3 by @kaijchen in #463
  • [Deps] Bump slf4j to 1.7.36 to fix vulnerability in slf4j-log4j12 by @kaijchen in #464
  • [ISSUE-448][Feature] shuffle server report storage info by @advancedxy in #449
  • [ISSUE-472] Fix Flaky Test: LocalFileServerReadHandlerTest#testDataInconsistent by @zuston in #473
  • [Improvement] Remove some unused empty server metrics by @zuston in #474
  • [Improvement] Add more logs about data flush by @zuston in #482
  • Fix potential race condition when registering remote storage info by @zuston in #481
  • [Minor] Improve readability by replacing lambda with method reference by @iwangjie in #488
  • [ISSUE-489][Minor] Cleanup some code by @iwangjie in #490
  • [Test] Assume unknown blockID in LocalFileHandlerTestBase by @kaijchen in #478
  • [ISSUE-475][Improvement] It's unnecessary to use ConcurrentHashMap for "partitionToBlockIds" in RssShuffleWriter by @zjf2012 in #480
  • [ISSUE-378][HugePartition][Part-2] Introduce memory usage limit and data flush by @zuston in #471
  • [ISSUE-484] Fix accidentally remove the storage of appId when unregistering partial shuffle in HdfsStorageManager by @zuston in #485
  • [Test] Cleanup tests with Files#createTempDir() by @kaijchen in #492
  • [Test] Add ConfigUtilsTest by @kaijchen in #500
  • [Deps] Bump protobuf to 3.19.6 to address vulnerability by @kaijchen in #499
  • [Minor] Make Constants final by @kaijchen in #501
  • Cleanup UnitConverter and improve UnitConverterTest by @kaijchen in #504
  • Fixes errors in doc header and operator install command by @zuston in #506
  • [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition by @zuston in #494
  • [ISSUE-378][HugePartition][Part-4] Supplement doc about huge partitions by @zuston in #505
  • [Deps] Switch to slf4j-reload4j by @kaijchen in #508
  • [ISSUE-512][Operator] Bump golang to 1.17 by @kaijchen in #515
  • [ISSUE-514] Fix flaky test: ShuffleServerGrpcTest#clearResourceTest by @xianjingfeng in #516
  • [ISSUE-507] Fix Flaky Test: ShuffleBufferManagerTest#cacheShuffleDataTest by @xianjingfeng in #511
  • [ISSUE-509] Fix Flaky Test: ShuffleBufferManagerTest#shuffleFlushThreshold by @xianjingfeng in #510
  • [ISSUE-479][operator] refine operator's build system by @advancedxy in #491
  • [ISSUE_479][operator][followup] use exec form instead of shell form in Dockerfile by @advancedxy in #518
  • [SpotBugs] Set threshold to middle with exceptions by @kaijchen in #517
  • [ISSUE-522][operator] pass RSS_IP to coordinator container env by @advancedxy in #523
  • [ISSUE-496][operator] infer resource request/limit from spec for init container by @advancedxy in #521
  • [SpotBugs] Remove unread protected field in Checker by @kaijchen in #520
  • [ISSUE-468] Put unavailable servers to the end of the list when sending shuffle data by @xianjingfeng in #470
  • chore: add new collaborator by @advancedxy in #535
  • [SpotBugs] Fix REC_CATCH_EXCEPTION by @kaijchen in #527
  • [ISSUE-524][operator] Upgrading rss could also be deleted by @advancedxy in #531
  • [ISSUE-553] Avoid removing buffer multiple times when clearing resources by @xianjingfeng in #534
  • [ISSUE-469][operator] feat: supports adding labels to rss pods. by @wangao1236 in #528
  • [SpotBugs] Fix UWF_UNWRITTEN_PUBLIC_OR_PROTECTED_FIELD by @kaijchen in #536
  • Result of mkdirs() is ignored in LocalFileWriteHandler#createBasePath() by @kaijchen in #537
  • [Minor] Optimize ShuffleServerInfo#hashCode by @xianjingfeng in #538
  • [SpotBugs] Enable SC_START_IN_CTOR check by @kaijchen in #541
  • [operator] fix error kind of ownerreference by @wangao1236 in #540
  • [ISSUE-542] Ensure the elements of StatusCode and RssProtos.StatusCode are the same by @xianjingfeng in #543
  • Flaky Test: LowestIOSampleCostSelectStorageStrategyTest#selectStorageTest by @smallzhongfeng in #544
  • [Improvement] Only report to the shuffle servers that owns the blocks by @zuston in #539
  • [Minor] Cleanup "throws RuntimeException" by @kaijchen in #549
  • [ISSUE-546] Replace ResponseStatusCode with StatusCode by @xianjingfeng in #547
  • [ISSUE-476][FEATURE] Respect spark.shuffle.compress configuration in Uniffle by @zjf2012 in #495
  • [ISSUE-545][operator] feat: support setting runtime class name and env for rss by @wangao1236 in #548
  • [ISSUE #525][operator] refine svc creations by @advancedxy in #530
  • [#552] docs: add more doc about spark.serializer requirement by @advancedxy in #556
  • [#559] test: use withEnvironmentVariable to replace RssUtilsTest#setEnv by @advancedxy in #560
  • [Followup #249] refactor: cleanup code and unify interfaces by @kaijchen in #558
  • [#554] feat: infer rss base storage conf from env by @advancedxy in #555
  • [MINOR] Remove commiters from collaborators by @jerqi in #563
  • [#525][FOLLOWUP] fix: add omitempty tag by @advancedxy in #565
  • [#545][FOLLOWUP] update rbac rule for webhook by @advancedxy in #566
  • [#571] feat: skip init for empty writable base dir by @advancedxy in #573
  • [#567] feat: add a shuffle-server metric about read_used_buffer_size by @zuston in #568
  • [MINOR] test: fix assertion in tests by @kaijchen in #574
  • [#575] refactor: replace switch-case with EnumMap in ComposedClientReadHandler by @kaijchen in #570
  • [#580] chore: improve CI workflows by @kaijchen in #579
  • [#410] feat: support the hot reload of coordinator's configuration by @jerqi in #572
  • [MINOR] chore(deps): bump go-restful to 2.16.0 in operator by @dependabot in #577
  • [#580] chore: move deploy/kubernetes to a standalone workflow by @kaijchen in #578
  • [MINOR] refactor: simplify ShuffleWriteClientImpl#genServerToBlocks() by @kaijchen in #594
  • [MINOR] chore: remove duplicated dependency in rss-client-mr by @kaijchen in #599
  • [#580] chore: speed up CI workflows by @kaijchen in #602
  • [MINOR] chore(operator): bump prometheus/client_golang to 1.11.1 by @dependabot in #601
  • [MINOR] docs: correct the format of server_guide doc by @zuston in #608
  • [FOLLOWUP] fix: don't recreate base dir if it's already existed by @advancedxy in #622