跳到主要内容

Uniffle Release 0.8.0

Highlight

  • Support TEZ
  • Introduce Netty for shuffle data transmission
  • Use off heap memory to store shuffle data.
  • Introduce REST API for cluster management.
  • Introduce command line for cluster management.

ChangeLog

  • [#1086][Doc] Simplify the Gluten code and add the doc (#1322)
  • [#1290] improvement(operator): Avoid accidentally deleting data of other services when misconfiguring the mounting directory (#1291)
  • [MINOR] fix: flaky test ShuffleTaskManagerTest#checkAndClearLeakShuffleDataTest (#1320)
  • [MINOR] test: flaky test GrpcServerTest.testGrpcExecutorPool (#1321)
  • [MINOR] docs: update jar name for spark client (#1289)
  • [MINOR] chore: add scripts for publishing tarballs to svn (#1284)
  • [MINOR] fix: incorrect version in spark client shaded modules
  • [#1275] chore: add scripts for publishing maven releases (#1281)
  • [#1274] feat: add shaded module for spark2 client (#1280)
  • [#1273] feat: add shaded module for spark3 client (#1279)
  • [#1277] chore: add flatten maven plugin (#1278)
  • [#1252] fix(server): Incorrect storage write fail metric (#1253)
  • [#1254][FOLLOWUP] fix(test): Fix the flaky test RssShuffleTest. (#1259)
  • [#1261] fix(spark): Throw out InterruptedException for sleep in requestExecutorMemory #1262
  • [#1254] fix(test): Fix the flaky test RssShuffleTest. (#1255)
  • [#1243] fix(test): Fix the flaky test SparkSQLTest and RepartitionTest (#1244)
  • [#1219] fix(test): Fix the flaky test WriteAndReadMetricsTest (#1235)
  • [MINOR] Fix kubernetest CI pipeline (#1227)
  • [#1211] fix(server): unexpectedly removing resources when app has re-registered shuffle later (#1212)
  • [#1204] chores(ci): Fix the ci pipeline of Kubernetes #1205
  • [#1182] fix(operator): The LeaderElectionNamespace of the rss-controller is hard-coded to kube-system. (#1183)
  • [#1175] fix(netty): Retry failed with StacklessClosedChannelException after channel closed (#1181)
  • [#1177] improvement: Reduce the write time of tasks (#1179)
  • [MINOR] docs: Fix spark.serializer in README and client_guide (#1180)
  • [#1173] fix: incorrect shuffle server status (#1174)
  • [#1164] refactor: Exposing the getDataLen method for ShuffleDataResult (#1170)
  • [#1165][FOLLOWUP] Fix some spelling mistakes. (#1172)
  • [#1165] improvement(tez): Unregister shuffle data after completing the execution of a DAG. (#1166)
  • [#1013] fix(tez): Wait when return MapOutput of type wait (#1063)
  • [#1062] fix(tez): Fix to use TezIdHelper to getTaskAttemptId instead of default IdHelper. (#1078)
  • [#1161] improvement(netty): Reduce the data copy when accessing data buffer len (#1162)
  • [#722] test: cleanup residue files in tmp directory after tests (#1134)
  • [#1108] feat(server): Add labels with disk path for total_localfile_write_data metrics. (#1160)
  • [#1152] fix: Direct memory may leak in shuffle server. (#1154)
  • [#1155] fix(netty): io.netty.util.internal.OutOfDirectMemoryError. (#1151)
  • [#1051] fix(mr): Ensure configurations in both mapred-site.xml and dynamic_client.conf take effect. (#1112)
  • [#1127] fix(netty): incorrect bytebuf release for ShuffleBlockInfo data in client side (#1150)
  • [#1132] improvement(spark): Unregister shuffle explicitly when Spark application is stopped. #1139
  • [#1044] chore: Add LICENSE-binary (#1054)
  • [#1143] docs: Correct sequence number text by reducing paragraph indentation by 1 space (#1144)
  • [#1124] docs(tez): Add the document of config option tez.rss.client.send.thread.num (#1142)
  • [#1124] docs(tez): Add the document for tez-client (#1140)
  • [#837] feat: Display information of all application through Cli. (#964)
  • [#951] fix rss-server Docker image building bug. (#1027)
  • [#1115] improvement(tez&mr): Unregister shuffle explicitly when application is stopped. (#1131)
  • [#1111] fix: Shuffle server can not delete remote storage path of secured HDFS cluster (#1122)
  • [#1124] docs: modify the document for tez-client (#1126)
  • [#1114] feat: introduce hdfs host as the total_hadoop_write_data metric label (#1107)
  • [#1109] fix(tez): Fix the user of remote storage. (#1128)
  • [#1124] docs: Add the document for tez-client (#1125)
  • [#1081] fix(tez): shuffle can not read the data which is flushed to hdfs (#1118)
  • [#299] improvement: Make config type of RSS_STORAGE_TYPE as enum (#1123)
  • [#1045] fix(server): shuffle server may hang when restart worker due to multi require-momery and no require-momery release (#1058)
  • [#1060] doc: Add the document for Netty. (#1116)
  • [#434] refactor: New utility method to cover dynamic class loading in RSSUtils (#1104)
  • [#1006] feat(spark): Support Spark 3.4 (#1082)
  • [#1043] chore: Add NOTICE-binary (#1097)
  • [#1101] improvement(tez): Release server resources as soon as possible in RssDAGAppMaster. (#1102)
  • [MINOR]Fix(tez) add output mapOutputRecordCounter metrics (#1093)
  • [#1074] feat: Introduce the metric of local_storage_service_used_space (#1075)
  • Revert "[#299] improvement: Make config type of RSS_STORAGE_TYPE as enum (#1052)" (#1106)
  • [#299] improvement: Make config type of RSS_STORAGE_TYPE as enum (#1052)
  • [#1100] improvement(mr): Fail fast in SortWriteBufferManager when failed to send shuffle data to shuffle server. (#1103)
  • [#1095] doc: Add slack link to the readme.md (#1099)
  • [#1066] fix: The Jetty server fails to start when compiled with JDK 8 but runtime version is JDK 11. (#1067)
  • [#1070] fix(tez): shuffle server may leak if not register remote storage. (#1076)
  • [#1068] feat(tez): Fail fast in client when failed to send data to server. (#1069)
  • [MINOR] fix(mr): fix default value (#1035)
  • [#1064] improvement(tez): Make shuffle data send thread pool configurable in WriteBufferManager. (#1065)
  • [MINOR] doc: Add JDK benchmark tests (#1059)
  • [#477] feat(spark): Fix rss.resubmit.stage does not support dynamic client conf. (#1050)
  • [#1011] feat(tez): Avoid recompute succeeded task. (#1033)
  • [MINOR] chore(asf): stop forwarding GitHub issues to dev mail list (#1032)
  • [#1048] fix(coord): ExcludeNodes does not take effect when the coordinator restarts. (#1049)
  • [#640] feat(netty): Metric system for netty server (#1041)
  • [#133] feat(netty): local shuffle read support zero-copy. (#1047)
  • [MINOR] doc: fix metrics document (#1040)
  • [#889] improvement: set default value of rss.server.max.concurrency.of.per-partition.write to 30. (#1037)
  • [#889] improvement: Modify default value of single buffer flush. (#1039)
  • [#889] improvement: Modify the default value of the rss.coordinator.select.partition.strategy parameter to CONTINUOUS. (#1036)
  • [#133] feat(netty): integration-test supports netty. (#1008)
  • [#1018] test(tez) RssUnorderedPartitionedKVOutputTest add close func unit test (#1034)
  • [MINOR] Add new collabrators (#1031)
  • [#808] feat(spark): ensure thread safe and data consistency when spill (#848)
  • [#1019] test(tez) RssOrderedPartitionedKVOutputTest add close func unit test (#1025)
  • [#477] feat(spark): support getShuffleResult throws FetchFailedException. (#1004)
  • [#973] improvement: Make shuffle manager client RPC timeout configurable (#1017)
  • [#846] feat(cli): Service's start & stop & restart through Cli. (#925)
  • [#388][FOLLOWUP] fix: Fix the flaky test GrpcServerTest (#1023)
  • [#993] feat(tez): Optimize the method of obtain the application attemptId (#1021)
  • [MINOR] refactor: use general method to check the remote storage existence (#980)
  • [#972 ] fix(tez): Add output mapOutputByteCounter metrics (#1016)
  • [MINOR] test(tez): add storage type configs in TezRemoteShuffleManagerTest (#1015)
  • [#999] improvement: apply REST principles in the urls of the http interfaces (#1000)
  • [#992] fix(tez): convertTaskAttemptIdToLong should not consider appattemptId (#1007)
  • [#1010]chore: enable and enforce spotless (#1009)
  • [#991] Improvement(tez): TezRemoteShuffleManager support secure cluster. (#1005)
  • [#957] feat(tez): Tez examples integration test (#982)
  • [#1001] improvement: support get all metrics by one request (#1002)
  • [#757] feat(server): separate flush thread pools for different storage types. (#775)
  • [#989] Bug(tez): parition class is not set for RssUnorderedKVOutput. (#994)
  • [#864] feat(server): Introduce Jersey to strengthen REST API (#939)
  • [#986][Improvement][tez] Optimize the method of obtain the vertex id. (#990)
  • [MINOR] docs: update the document and scripts of build tez. (#997)
  • [MINOR] Add new collabrators (#995)
  • [#983] improvement(tez): Optimize tez client delivery configuration (#985)
  • [MINOR] docs: update document for tez client plugin. (#987)
  • [#768][Follow Up] feat(cli): Cli method for blacklist update. (#968)
  • [#978][Improvement] Provides a tool class to format CLI output content (#979)
  • [#965] feat(tez): support remote shuffle for tez framework (#966)
  • [#961] refactor(coordinator): Improve Coordinator Log Format. (#962)
  • [#940] improvement: Optimize columnar shuffle integration (#958)
  • [#886] fix(tez): Tez Client may lost data or throw exception when rss.storage.type without MEMORY. (#976)
  • [MINOR] fix(tez): fix thread factory name (#975)
  • [#854][FOLLOWUP] feat(tez): Add Rss Input Class to begin Tez input task (#949)
  • [#956] refactor: Changes the Boolean flag that determines whether a Node is healthy to a state (#959)
  • [#854][FOLLOWUP] feat(tez): Add RssShuffleManager to run and manager shuffle work (#947)
  • [#381]imporvement(server): Check JAVA_HOME and HADOOP_HOME in start-shuffle-server.sh (#954)
  • [#768] feat(cli): Cli method for blacklist update (#931)
  • [#937] feat: Add rest api for servernode list of losing connection and unhealthy (#938)
  • [#940] feat: Support columnar shuffle with gluten (#950)
  • [#854][FOLLOWUP] feat(tez): Add RssShuffle to handle event and generate fetch task (#929)
  • [#854][FOLLOWUP] feat(tez): Add RssShuffleScheduler to run and manager shuffle work (#948)
  • [#590][part-1] ManagedBuffer instead ByteBuf to hold ShuffleData (#906)
  • [#854][FOLLOWUP] feat(tez): Add Simple Fetched Allocator to allocation memory or disk for shuffle data (#922)
  • [#855] feat(tez): Support Tez Output RssUnorderedKVOutput (#944)
  • [#855] feat(tez): Support Tez Output UnorderedPartitionedKVOutput (#943)
  • [#854][FOLLOWUP] feat(tez): Add RssTezFetcherTask to fetch data from worker for OrderedInput (#935)
  • [MINOR] Add new collaborators (#942)
  • [#343] improvement(build): Shade Netty Native lib (#924)
  • [#855] feat(tez): Support Tez Output OrderedPartitionedKVOutput (#930)
  • [#933] fix incorrect metric grpc_server_connection_number (#934)
  • [MINOR] fix: Fix kerberos ut error caused by Config.singleton is not refresh. (#932)
  • [#854][FOLLOWUP] feat(tez): add RssTezFetcher to fetch data from worker. (#920)
  • [#819 ] feat(tez): Tez ApplicationMaster supporting RemoteShuffle (#918)
  • [#927] Improvement: improve the control of server heartbeat (#928)
  • [MINOR] improvement(mr): Add @Test to activate test case (#923)
  • [MINOR] fix: Error logs upload fail (#917)
  • [#872][FOLLOWUP] feat(tez): Add UmbilicalUtils to get Worker info from AM (#919)
  • [#872][FOLLOWUP] feat(tez): Modify utils and add test case (#916)
  • [#872][FOLLOWUP] feat(tez): Add the common and utils class (#894)
  • [#872] feat(tez): Get parameter from Inputcontext then provide util function (#915)
  • [#908] feat(tez): Write byte array shuffle data to MapOutput (#909)
  • [MINOR] refactor: Fix tez workflow and checkstyle (#911)
  • [MINOR] docs: update document and build script for Hadoop-3.2 (#912)
  • [#896] Improvement: Support Hadoop-3.2 (#897)
  • [#881] fix: Ensure LocalStorageMeta disk size is correctly updated when events are processed (#902)
  • [#417] refactor: Eliminate raw use of parameterized class (#891)
  • [MINOR] docs: Add benchmark results (#904)
  • [MINOR] refactor: Add overwrite annotation (#905)
  • (jersey2) [#895] improvement: Rename Hdfs.java to Hadoop.java to support other Hadoop FS-compatible distributed filesystem (#898)
  • [MINOR] fix: Fix LocalStorageManager divide by zero exception (#900)
  • [#133] feat(netty): Fix IllegalReferenceCountException. (#899)
  • [#886] fix(mr): MR Client may lost data or throw exception when rss.storage.type without MEMORY. (#887)
  • [#872] feat(tez): Add the common and utils class (#890)
  • [#715] fix(mr): The container does not exit because shuffleclient is not closed (#882)
  • [#133] feat(netty): Implement ShuffleServer interface. (#879)
  • [#884] improvement: Make start and stop scripts executable under the bin folder (#885)
  • [MINOR] Remove unused code of shuffle upload (#883)
  • [MINOR] chore: delete checkstyle-suppressions.xml (#878)
  • [#493] improvement: replace putIfAbsent with computeIfAbsent to avoid performance loss in some critical paths (#876)
  • [MINOR] refactor: Reduce the usage of memory in the ShuffleWriter (#877)
  • [#863] feat(coordinator): support comments in exclude node files (#874)
  • [#774] docs: Fix the metadata.annotations: Too long in install.md (#853)
  • [MINOR] chore(ci): Add Tez pipeline (#862)
  • [#593][FOLLOWUP] Fix zstd compress dest ByteBuffer position
  • [#859][Improvement] Set MALLOC_ARENA_MAX in start-shuffle-server.sh (#860)
  • [#596][FOLLOWUP] Index data support offheap read (#852)
  • [#133] feat(netty): Introduce ShuffleServerGrpcNettyClient. (#839)
  • [#770] feat(cli): Introduce apache.commons.cli basic framework (#833)
  • [#414] feat(client): support specifying per-partition's max concurrency to write in client side (#815)
  • [#841] feat(config): Support deprecated and fallback keys for ConfigOptions (#843)
  • fix(client): disable spark memory spill (#844)
  • build(operator): update clusterrole of controller and webhook (#842)
  • [Improvement] Make codec to be a singleton. (#840)
  • [#593][part-1] feat: Codec compress support ByteBuffer (#830)
  • [#133] feat(netty): Rewrite protocol. (#826)
  • [MINOR] Remove unused config SHUFFLE_EXPIRED_TIMEOUT_MS (#835)
  • [#827] feat(operator): support generating hpa objects (#828)
  • [MINOR] Modify MD format (#829)
  • [#778] feat: Separate ShuffleServer metrics through tags (#812)
  • [MINOR] Add WORKDIR (#823)
  • [#416] feat(hdfs): lazy initialization of hdfsShuffleWriteHandler when per-partition concurrent write is enabled (#816)
  • [#596] feat(netty): Use off heap memory to read HDFS data (#806)
  • [#779] feat: Grpc server support random port (#820)
  • [MINOR] chore: update import order rule to make scala a separated group (#818)
  • [#804] improvement: Optimize CRC calculation of ByteBuffer (#805)
  • [#477][part-1] feat: support stage resubmit in spark clients (#787)
  • [MINOR] test: Fix the flaky test GrpcServerTest (#789)
  • [BUG][MINOR] Fix scripts compatible with jdk8 & jdk11 (#798)
  • [#796] bug:fix the issues of MetricReporter (#797)
  • [#799] feat: use storage host label for remote storage write metrics (#800)
  • [MINOR] fix: the description of Spark patches in the README.md (#801)
  • [#794] feat(operator): support delete ShuffleServer pod with Evicted status (#795)
  • [#720][FOLLOW-UP] Correct the shuffle server id (#792)
  • [#782]refactor: restrict rss.rpc.server.type to an enum (#783)
  • [#720] feat(netty): support random port for netty (#723)
  • [#584] feat(netty): Add transport client pool for netty (#771)
  • [MINOR] fix: set the default app number in AccessAppQuotaChecker (#786)
  • [#477][part-0] feat: add ShuffleManagerServer impl (#777)
  • chore(spark): remove noisy log in client side (#776)
  • [#772] fix(kerberos): cache proxy user ugi to avoid memory leak (#773)
  • [#706] feat(spark): support spill to avoid memory deadlock (#714)
  • [#755] refactor: Add the method of creating thread pool (#767)
  • [#716] improvement(operator): support specifying imagePullSecrets (#765)
  • [#519] Speed up ConcurrentHashMap#computeIfAbsent (#766)
  • [MINOR] chore(operator): remove useless configuration items in configuration example (#764)
  • [#133] feat(netty): Add Encoder and Decoder. (#742)
  • [#762] If the storage path is not exist, get file store for its parent. (#763)
  • [MINOR] test: do not wrap test in try-catch block (#746)
  • [MINOR] chore: Remove committers from collaborators (#754)
  • [#752] refactor: replace RuntimeException with RssException (#753)
  • [MINOR] test: untracked files created in ShuffleFlushManagerTest (#745)
  • [#750] feat: add RssFetchFailedException (#751)
  • [#747] feat(tez): Add Tez Framework (#748)
  • [#736] feat(storage): best effort to write same hdfs file when no race condition (#744)
  • [#719] feat(netty): Optimize allocation strategy (#739)
  • [#669] improvement: refresh application when reading memory data (#741)
  • [#133] feat(netty): Add Netty Utils (#727)
  • [MINOR] test: fix CoordinatorGrpcTest#shuffleServerHeartbeatTest on Linux SSD platform (#738)
  • [#729] improvement: use foreach when iterate over Roaring64NavigableMap for better performance (#730)
  • [#733] test: fix LocalStorageManagerTest#testGetLocalStorageInfo on Linux SSD platform (#734)
  • Revert "[MINOR] test: fix tempdir leak in KerberizedHdfs tests (#721)" (#732)
  • [MINOR] log(client): update log level from INFO to DEBUG to avoid noisy (#725)
  • [MINOR] chore(ci): test compile with java.version 11 and 17 (#728)
  • [#625] improvement: Package sun.security.krb5 is not visible in Java 11 and 17. (#726)
  • [#133] feat(netty): Add StreamServer. (#718)
  • [MINOR] test: test class name should end with Test (#724)
  • [MINOR] test: fix tempdir leak in KerberizedHdfs tests (#721)
  • [#711] feat(netty): Add Netty port information for Shuffle Server (#712)
  • [#564] test(operator): add end-to-end test (#581)
  • [#615] improvement: Reduce task binary by removing 'partitionToServers' from RssShuffleHandle (#637)
  • [MINOR] Removed unused methods and variable (#702)
  • [#483] test: fix flaky test ShuffleServerFaultToleranceTest (#705)
  • [#708] test: do not assume hostname of hdfs mini-cluster (#709)
  • [MINOR] test: fix static field initialized before TempDir field injected (#707)
  • [#80][Part-3] feat: add REST API for decommisson (#684)
  • [MINOR] fix: Add method close for ApplicationManager (#704)
  • [MINOR] improvement: Reduce the size of Spark patch (#699)
  • [#674] feat(docker): use JDK11 as the default java version in Dockerfile (#683)
  • chore: remove unused log info (#700)
  • build(deps): bump golang.org/x/net in /deploy/kubernetes/operator (#676)
  • [#585] fix: avoid unbound variable errors in start-shuffle-server.sh
  • [#691] fix(test): flaky test CoordinatorMetricsTest#testCoordinatorMetrics
  • [#697] fix: use the naive equals method to avoid introducing additional dependencies (#698)
  • [#585] feat(netty): Add MaxDirectMemorySize option for shuffle Server (#690)
  • [#397] docs: add the usage of AccessQuotaChecker (#692)
  • [MINOR] Use multithreading to detect multiple disks (#687)
  • [#678] improvement: Write hdfs files asynchronously when detectStorage (#680)
  • [#645][Improvement] feat(operator): support manager parameter configuration (#670)
  • [#671] feat(coordinator): Metrics of the number of apps submitted by users (#672)
  • [#675] fix: filter no space exception in checkStorageReadAndWrite (#677)
  • [#80][Part-2] feat: Add RPC logic and heartbeat logic for decommisson (#663)
  • feat: introduce storage manager selector to support more selector strategy (#621)
  • [MINOR] test: address unchecked conversions (#624)
  • [#642]feat(server): better default options for shuffle server (#662)
  • [#665] feat(client): keep consistent with vanilla spark when key or value is null (#666)
  • [MINOR] docs: fix flaky-test-report.yml (#664)
  • [#408] test: fix memory check failure in ShuffleBufferManagerTest#bufferSizeTest (#657)
  • [#612] test: cleanup shuffleServer instance for each test (#658)
  • [#659] fix(server): fix NPE of ShuffleDataFlushEvent's underStorage in some cases (#660)
  • feat(coordinator): heap size configurable and add gc log (#656)
  • [MINOR] docs: update the project introduction in README (#653)
  • [#647][FOLLOWUP] set coordinator id before ApplicationMaster (#654)
  • [#649] test: remove mini-cluster in ClientConfManagerTest (#650)
  • [ISSUE-576] Increase the timeout and remove the initialization time (#646)
  • [#635] feat(client): enable LOCAL_ORDER by default for Spark AQE (#644)
  • [#647] fix: Multiple coordinator produce conflicts when they delect the same file (#648)
  • [#545][FOLLOWUP]feat(operator): support specifying custom affinity & tolerations (#641)
  • [MINOR] refactor: address unchecked conversions (#623)
  • [#80][Part-1] feat: Add decommisson logic to shuffle server (#606)
  • [#630] feat(client): Disable the localShuffleReader by default in Spark3 client (#636)
  • [#632]fix: respect volumes in rss spec (#634)
  • [#600] chore(operator): change JDK base from openjdk to eclipse-temurin (#617)
  • [#626][FOLLOWUP] chore(ci): fix typo in build.yml (#633)
  • [#627] fix(operator): support specifying custom ports (#629)
  • [#626] chore(ci): skip build operator if no code changes (#628)
  • [#580] chore(ci): disable parallel build in maven (#631)
  • feat: support downloads latest hadoop archives and mirror url (#619)
  • [FOLLOWUP] fix: don't recreate base dir if it's already existed (#616)
  • [#551] docs: update templates for flaky test and pull request (#588)
  • [MINOR] fix: allow mountPoint not containing '/'
  • [#613] test: SimpleClusterManagerTest#updateExcludeNodesTest (#614)