Uniffle Release 0.8.0
Highlight
- Support TEZ
- Introduce Netty for shuffle data transmission
- Use off heap memory to store shuffle data.
- Introduce REST API for cluster management.
- Introduce command line for cluster management.
ChangeLog
- [#1086][Doc] Simplify the Gluten code and add the doc (#1322)
- [#1290] improvement(operator): Avoid accidentally deleting data of other services when misconfiguring the mounting directory (#1291)
- [MINOR] fix: flaky test ShuffleTaskManagerTest#checkAndClearLeakShuffleDataTest (#1320)
- [MINOR] test: flaky test GrpcServerTest.testGrpcExecutorPool (#1321)
- [MINOR] docs: update jar name for spark client (#1289)
- [MINOR] chore: add scripts for publishing tarballs to svn (#1284)
- [MINOR] fix: incorrect version in spark client shaded modules
- [#1275] chore: add scripts for publishing maven releases (#1281)
- [#1274] feat: add shaded module for spark2 client (#1280)
- [#1273] feat: add shaded module for spark3 client (#1279)
- [#1277] chore: add flatten maven plugin (#1278)
- [#1252] fix(server): Incorrect storage write fail metric (#1253)
- [#1254][FOLLOWUP] fix(test): Fix the flaky test RssShuffleTest. (#1259)
- [#1261] fix(spark): Throw out InterruptedException for sleep in requestExecutorMemory #1262
- [#1254] fix(test): Fix the flaky test RssShuffleTest. (#1255)
- [#1243] fix(test): Fix the flaky test
SparkSQLTest
andRepartitionTest
(#1244) - [#1219] fix(test): Fix the flaky test
WriteAndReadMetricsTest
(#1235) - [MINOR] Fix kubernetest CI pipeline (#1227)
- [#1211] fix(server): unexpectedly removing resources when app has re-registered shuffle later (#1212)
- [#1204] chores(ci): Fix the ci pipeline of Kubernetes #1205
- [#1182] fix(operator): The LeaderElectionNamespace of the rss-controller is hard-coded to kube-system. (#1183)
- [#1175] fix(netty): Retry failed with StacklessClosedChannelException after channel closed (#1181)
- [#1177] improvement: Reduce the write time of tasks (#1179)
- [MINOR] docs: Fix spark.serializer in README and client_guide (#1180)
- [#1173] fix: incorrect shuffle server status (#1174)
- [#1164] refactor: Exposing the getDataLen method for ShuffleDataResult (#1170)
- [#1165][FOLLOWUP] Fix some spelling mistakes. (#1172)
- [#1165] improvement(tez): Unregister shuffle data after completing the execution of a DAG. (#1166)
- [#1013] fix(tez): Wait when return MapOutput of type wait (#1063)
- [#1062] fix(tez): Fix to use TezIdHelper to getTaskAttemptId instead of default IdHelper. (#1078)
- [#1161] improvement(netty): Reduce the data copy when accessing data buffer len (#1162)
- [#722] test: cleanup residue files in tmp directory after tests (#1134)
- [#1108] feat(server): Add labels with disk path for
total_localfile_write_data
metrics. (#1160) - [#1152] fix: Direct memory may leak in shuffle server. (#1154)
- [#1155] fix(netty): io.netty.util.internal.OutOfDirectMemoryError. (#1151)
- [#1051] fix(mr): Ensure configurations in both mapred-site.xml and dynamic_client.conf take effect. (#1112)
- [#1127] fix(netty): incorrect bytebuf release for ShuffleBlockInfo data in client side (#1150)
- [#1132] improvement(spark): Unregister shuffle explicitly when Spark application is stopped. #1139
- [#1044] chore: Add LICENSE-binary (#1054)
- [#1143] docs: Correct sequence number text by reducing paragraph indentation by 1 space (#1144)
- [#1124] docs(tez): Add the document of config option
tez.rss.client.send.thread.num
(#1142) - [#1124] docs(tez): Add the document for tez-client (#1140)
- [#837] feat: Display information of all application through Cli. (#964)
- [#951] fix rss-server Docker image building bug. (#1027)
- [#1115] improvement(tez&mr): Unregister shuffle explicitly when application is stopped. (#1131)
- [#1111] fix: Shuffle server can not delete remote storage path of secured HDFS cluster (#1122)
- [#1124] docs: modify the document for tez-client (#1126)
- [#1114] feat: introduce hdfs host as the total_hadoop_write_data metric label (#1107)
- [#1109] fix(tez): Fix the user of remote storage. (#1128)
- [#1124] docs: Add the document for tez-client (#1125)
- [#1081] fix(tez): shuffle can not read the data which is flushed to hdfs (#1118)
- [#299] improvement: Make config type of RSS_STORAGE_TYPE as enum (#1123)
- [#1045] fix(server): shuffle server may hang when restart worker due to multi require-momery and no require-momery release (#1058)
- [#1060] doc: Add the document for Netty. (#1116)
- [#434] refactor: New utility method to cover dynamic class loading in RSSUtils (#1104)
- [#1006] feat(spark): Support Spark 3.4 (#1082)
- [#1043] chore: Add NOTICE-binary (#1097)
- [#1101] improvement(tez): Release server resources as soon as possible in RssDAGAppMaster. (#1102)
- [MINOR]Fix(tez) add output mapOutputRecordCounter metrics (#1093)
- [#1074] feat: Introduce the metric of
local_storage_service_used_space
(#1075) - Revert "[#299] improvement: Make config type of RSS_STORAGE_TYPE as enum (#1052)" (#1106)
- [#299] improvement: Make config type of RSS_STORAGE_TYPE as enum (#1052)
- [#1100] improvement(mr): Fail fast in SortWriteBufferManager when failed to send shuffle data to shuffle server. (#1103)
- [#1095] doc: Add slack link to the readme.md (#1099)
- [#1066] fix: The Jetty server fails to start when compiled with JDK 8 but runtime version is JDK 11. (#1067)
- [#1070] fix(tez): shuffle server may leak if not register remote storage. (#1076)
- [#1068] feat(tez): Fail fast in client when failed to send data to server. (#1069)
- [MINOR] fix(mr): fix default value (#1035)
- [#1064] improvement(tez): Make shuffle data send thread pool configurable in WriteBufferManager. (#1065)
- [MINOR] doc: Add JDK benchmark tests (#1059)
- [#477] feat(spark): Fix
rss.resubmit.stage
does not support dynamic client conf. (#1050) - [#1011] feat(tez): Avoid recompute succeeded task. (#1033)
- [MINOR] chore(asf): stop forwarding GitHub issues to dev mail list (#1032)
- [#1048] fix(coord): ExcludeNodes does not take effect when the coordinator restarts. (#1049)
- [#640] feat(netty): Metric system for netty server (#1041)
- [#133] feat(netty): local shuffle read support zero-copy. (#1047)
- [MINOR] doc: fix metrics document (#1040)
- [#889] improvement: set default value of
rss.server.max.concurrency.of.per-partition.write
to30
. (#1037) - [#889] improvement: Modify default value of single buffer flush. (#1039)
- [#889] improvement: Modify the default value of the
rss.coordinator.select.partition.strategy
parameter toCONTINUOUS
. (#1036) - [#133] feat(netty): integration-test supports netty. (#1008)
- [#1018] test(tez) RssUnorderedPartitionedKVOutputTest add close func unit test (#1034)
- [MINOR] Add new collabrators (#1031)
- [#808] feat(spark): ensure thread safe and data consistency when spill (#848)
- [#1019] test(tez) RssOrderedPartitionedKVOutputTest add close func unit test (#1025)
- [#477] feat(spark): support getShuffleResult throws FetchFailedException. (#1004)
- [#973] improvement: Make shuffle manager client RPC timeout configurable (#1017)
- [#846] feat(cli): Service's start & stop & restart through Cli. (#925)
- [#388][FOLLOWUP] fix: Fix the flaky test GrpcServerTest (#1023)
- [#993] feat(tez): Optimize the method of obtain the application attemptId (#1021)
- [MINOR] refactor: use general method to check the remote storage existence (#980)
- [#972 ] fix(tez): Add output mapOutputByteCounter metrics (#1016)
- [MINOR] test(tez): add storage type configs in TezRemoteShuffleManagerTest (#1015)
- [#999] improvement: apply REST principles in the urls of the http interfaces (#1000)
- [#992] fix(tez): convertTaskAttemptIdToLong should not consider appattemptId (#1007)
- [#1010]chore: enable and enforce spotless (#1009)
- [#991] Improvement(tez): TezRemoteShuffleManager support secure cluster. (#1005)
- [#957] feat(tez): Tez examples integration test (#982)
- [#1001] improvement: support get all metrics by one request (#1002)
- [#757] feat(server): separate flush thread pools for different storage types. (#775)
- [#989] Bug(tez): parition class is not set for RssUnorderedKVOutput. (#994)
- [#864] feat(server): Introduce Jersey to strengthen REST API (#939)
- [#986][Improvement][tez] Optimize the method of obtain the vertex id. (#990)
- [MINOR] docs: update the document and scripts of build tez. (#997)
- [MINOR] Add new collabrators (#995)
- [#983] improvement(tez): Optimize tez client delivery configuration (#985)
- [MINOR] docs: update document for tez client plugin. (#987)
- [#768][Follow Up] feat(cli): Cli method for blacklist update. (#968)
- [#978][Improvement] Provides a tool class to format CLI output content (#979)
- [#965] feat(tez): support remote shuffle for tez framework (#966)
- [#961] refactor(coordinator): Improve Coordinator Log Format. (#962)
- [#940] improvement: Optimize columnar shuffle integration (#958)
- [#886] fix(tez): Tez Client may lost data or throw exception when rss.storage.type without MEMORY. (#976)
- [MINOR] fix(tez): fix thread factory name (#975)
- [#854][FOLLOWUP] feat(tez): Add Rss Input Class to begin Tez input task (#949)
- [#956] refactor: Changes the Boolean flag that determines whether a Node is healthy to a state (#959)
- [#854][FOLLOWUP] feat(tez): Add RssShuffleManager to run and manager shuffle work (#947)
- [#381]imporvement(server): Check JAVA_HOME and HADOOP_HOME in start-shuffle-server.sh (#954)
- [#768] feat(cli): Cli method for blacklist update (#931)
- [#937] feat: Add rest api for servernode list of losing connection and unhealthy (#938)
- [#940] feat: Support columnar shuffle with gluten (#950)
- [#854][FOLLOWUP] feat(tez): Add RssShuffle to handle event and generate fetch task (#929)
- [#854][FOLLOWUP] feat(tez): Add RssShuffleScheduler to run and manager shuffle work (#948)
- [#590][part-1] ManagedBuffer instead ByteBuf to hold ShuffleData (#906)
- [#854][FOLLOWUP] feat(tez): Add Simple Fetched Allocator to allocation memory or disk for shuffle data (#922)
- [#855] feat(tez): Support Tez Output RssUnorderedKVOutput (#944)
- [#855] feat(tez): Support Tez Output UnorderedPartitionedKVOutput (#943)
- [#854][FOLLOWUP] feat(tez): Add RssTezFetcherTask to fetch data from worker for OrderedInput (#935)
- [MINOR] Add new collaborators (#942)
- [#343] improvement(build): Shade Netty Native lib (#924)
- [#855] feat(tez): Support Tez Output OrderedPartitionedKVOutput (#930)
- [#933] fix incorrect metric grpc_server_connection_number (#934)
- [MINOR] fix: Fix kerberos ut error caused by Config.singleton is not refresh. (#932)
- [#854][FOLLOWUP] feat(tez): add RssTezFetcher to fetch data from worker. (#920)
- [#819 ] feat(tez): Tez ApplicationMaster supporting RemoteShuffle (#918)
- [#927] Improvement: improve the control of server heartbeat (#928)
- [MINOR] improvement(mr): Add @Test to activate test case (#923)
- [MINOR] fix: Error logs upload fail (#917)
- [#872][FOLLOWUP] feat(tez): Add UmbilicalUtils to get Worker info from AM (#919)
- [#872][FOLLOWUP] feat(tez): Modify utils and add test case (#916)
- [#872][FOLLOWUP] feat(tez): Add the common and utils class (#894)
- [#872] feat(tez): Get parameter from Inputcontext then provide util function (#915)
- [#908] feat(tez): Write byte array shuffle data to MapOutput (#909)
- [MINOR] refactor: Fix tez workflow and checkstyle (#911)
- [MINOR] docs: update document and build script for Hadoop-3.2 (#912)
- [#896] Improvement: Support Hadoop-3.2 (#897)
- [#881] fix: Ensure LocalStorageMeta disk size is correctly updated when events are processed (#902)
- [#417] refactor: Eliminate raw use of parameterized class (#891)
- [MINOR] docs: Add benchmark results (#904)
- [MINOR] refactor: Add overwrite annotation (#905)
- (jersey2) [#895] improvement: Rename Hdfs.java to Hadoop.java to support other Hadoop FS-compatible distributed filesystem (#898)
- [MINOR] fix: Fix LocalStorageManager divide by zero exception (#900)
- [#133] feat(netty): Fix IllegalReferenceCountException. (#899)
- [#886] fix(mr): MR Client may lost data or throw exception when rss.storage.type without MEMORY. (#887)
- [#872] feat(tez): Add the common and utils class (#890)
- [#715] fix(mr): The container does not exit because shuffleclient is not closed (#882)
- [#133] feat(netty): Implement ShuffleServer interface. (#879)
- [#884] improvement: Make start and stop scripts executable under the bin folder (#885)
- [MINOR] Remove unused code of shuffle upload (#883)
- [MINOR] chore: delete checkstyle-suppressions.xml (#878)
- [#493] improvement: replace
putIfAbsent
withcomputeIfAbsent
to avoid performance loss in some critical paths (#876) - [MINOR] refactor: Reduce the usage of memory in the
ShuffleWriter
(#877) - [#863] feat(coordinator): support comments in exclude node files (#874)
- [#774] docs: Fix the metadata.annotations: Too long in
install.md
(#853) - [MINOR] chore(ci): Add Tez pipeline (#862)
- [#593][FOLLOWUP] Fix zstd compress dest ByteBuffer position
- [#859][Improvement] Set MALLOC_ARENA_MAX in start-shuffle-server.sh (#860)
- [#596][FOLLOWUP] Index data support offheap read (#852)
- [#133] feat(netty): Introduce ShuffleServerGrpcNettyClient. (#839)
- [#770] feat(cli): Introduce apache.commons.cli basic framework (#833)
- [#414] feat(client): support specifying per-partition's max concurrency to write in client side (#815)
- [#841] feat(config): Support deprecated and fallback keys for ConfigOptions (#843)
- fix(client): disable spark memory spill (#844)
- build(operator): update clusterrole of controller and webhook (#842)
- [Improvement] Make codec to be a singleton. (#840)
- [#593][part-1] feat: Codec compress support ByteBuffer (#830)
- [#133] feat(netty): Rewrite protocol. (#826)
- [MINOR] Remove unused config
SHUFFLE_EXPIRED_TIMEOUT_MS
(#835) - [#827] feat(operator): support generating hpa objects (#828)
- [MINOR] Modify MD format (#829)
- [#778] feat: Separate
ShuffleServer
metrics through tags (#812) - [MINOR] Add
WORKDIR
(#823) - [#416] feat(hdfs): lazy initialization of hdfsShuffleWriteHandler when per-partition concurrent write is enabled (#816)
- [#596] feat(netty): Use off heap memory to read HDFS data (#806)
- [#779] feat: Grpc server support random port (#820)
- [MINOR] chore: update import order rule to make scala a separated group (#818)
- [#804] improvement: Optimize CRC calculation of ByteBuffer (#805)
- [#477][part-1] feat: support stage resubmit in spark clients (#787)
- [MINOR] test: Fix the flaky test
GrpcServerTest
(#789) - [BUG][MINOR] Fix scripts compatible with jdk8 & jdk11 (#798)
- [#796] bug:fix the issues of MetricReporter (#797)
- [#799] feat: use storage host label for remote storage write metrics (#800)
- [MINOR] fix: the description of Spark patches in the README.md (#801)
- [#794] feat(operator): support delete ShuffleServer pod with Evicted status (#795)
- [#720][FOLLOW-UP] Correct the shuffle server id (#792)
- [#782]refactor: restrict rss.rpc.server.type to an enum (#783)
- [#720] feat(netty): support random port for netty (#723)
- [#584] feat(netty): Add transport client pool for netty (#771)
- [MINOR] fix: set the default app number in
AccessAppQuotaChecker
(#786) - [#477][part-0] feat: add ShuffleManagerServer impl (#777)
- chore(spark): remove noisy log in client side (#776)
- [#772] fix(kerberos): cache proxy user ugi to avoid memory leak (#773)
- [#706] feat(spark): support spill to avoid memory deadlock (#714)
- [#755] refactor: Add the method of creating thread pool (#767)
- [#716] improvement(operator): support specifying imagePullSecrets (#765)
- [#519] Speed up ConcurrentHashMap#computeIfAbsent (#766)
- [MINOR] chore(operator): remove useless configuration items in configuration example (#764)
- [#133] feat(netty): Add Encoder and Decoder. (#742)
- [#762] If the storage path is not exist, get file store for its parent. (#763)
- [MINOR] test: do not wrap test in try-catch block (#746)
- [MINOR] chore: Remove committers from collaborators (#754)
- [#752] refactor: replace RuntimeException with RssException (#753)
- [MINOR] test: untracked files created in ShuffleFlushManagerTest (#745)
- [#750] feat: add RssFetchFailedException (#751)
- [#747] feat(tez): Add Tez Framework (#748)
- [#736] feat(storage): best effort to write same hdfs file when no race condition (#744)
- [#719] feat(netty): Optimize allocation strategy (#739)
- [#669] improvement: refresh application when reading memory data (#741)
- [#133] feat(netty): Add Netty Utils (#727)
- [MINOR] test: fix CoordinatorGrpcTest#shuffleServerHeartbeatTest on Linux SSD platform (#738)
- [#729] improvement: use
foreach
when iterate over Roaring64NavigableMap for better performance (#730) - [#733] test: fix LocalStorageManagerTest#testGetLocalStorageInfo on Linux SSD platform (#734)
- Revert "[MINOR] test: fix tempdir leak in KerberizedHdfs tests (#721)" (#732)
- [MINOR] log(client): update log level from INFO to DEBUG to avoid noisy (#725)
- [MINOR] chore(ci): test compile with java.version 11 and 17 (#728)
- [#625] improvement: Package sun.security.krb5 is not visible in Java 11 and 17. (#726)
- [#133] feat(netty): Add StreamServer. (#718)
- [MINOR] test: test class name should end with Test (#724)
- [MINOR] test: fix tempdir leak in KerberizedHdfs tests (#721)
- [#711] feat(netty): Add Netty port information for Shuffle Server (#712)
- [#564] test(operator): add end-to-end test (#581)
- [#615] improvement: Reduce task binary by removing 'partitionToServers' from RssShuffleHandle (#637)
- [MINOR] Removed unused methods and variable (#702)
- [#483] test: fix flaky test ShuffleServerFaultToleranceTest (#705)
- [#708] test: do not assume hostname of hdfs mini-cluster (#709)
- [MINOR] test: fix static field initialized before TempDir field injected (#707)
- [#80][Part-3] feat: add REST API for decommisson (#684)
- [MINOR] fix: Add method
close
for ApplicationManager (#704) - [MINOR] improvement: Reduce the size of Spark patch (#699)
- [#674] feat(docker): use JDK11 as the default java version in Dockerfile (#683)
- chore: remove unused log info (#700)
- build(deps): bump golang.org/x/net in /deploy/kubernetes/operator (#676)
- [#585] fix: avoid unbound variable errors in start-shuffle-server.sh
- [#691] fix(test): flaky test CoordinatorMetricsTest#testCoordinatorMetrics
- [#697] fix: use the naive equals method to avoid introducing additional dependencies (#698)
- [#585] feat(netty): Add MaxDirectMemorySize option for shuffle Server (#690)
- [#397] docs: add the usage of AccessQuotaChecker (#692)
- [MINOR] Use multithreading to detect multiple disks (#687)
- [#678] improvement: Write hdfs files asynchronously when
detectStorage
(#680) - [#645][Improvement] feat(operator): support manager parameter configuration (#670)
- [#671] feat(coordinator): Metrics of the number of apps submitted by users (#672)
- [#675] fix: filter no space exception in checkStorageReadAndWrite (#677)
- [#80][Part-2] feat: Add RPC logic and heartbeat logic for decommisson (#663)
- feat: introduce storage manager selector to support more selector strategy (#621)
- [MINOR] test: address unchecked conversions (#624)
- [#642]feat(server): better default options for shuffle server (#662)
- [#665] feat(client): keep consistent with vanilla spark when key or value is null (#666)
- [MINOR] docs: fix flaky-test-report.yml (#664)
- [#408] test: fix memory check failure in ShuffleBufferManagerTest#bufferSizeTest (#657)
- [#612] test: cleanup shuffleServer instance for each test (#658)
- [#659] fix(server): fix NPE of ShuffleDataFlushEvent's underStorage in some cases (#660)
- feat(coordinator): heap size configurable and add gc log (#656)
- [MINOR] docs: update the project introduction in README (#653)
- [#647][FOLLOWUP] set coordinator id before ApplicationMaster (#654)
- [#649] test: remove mini-cluster in ClientConfManagerTest (#650)
- [ISSUE-576] Increase the timeout and remove the initialization time (#646)
- [#635] feat(client): enable LOCAL_ORDER by default for Spark AQE (#644)
- [#647] fix: Multiple coordinator produce conflicts when they delect the same file (#648)
- [#545][FOLLOWUP]feat(operator): support specifying custom affinity & tolerations (#641)
- [MINOR] refactor: address unchecked conversions (#623)
- [#80][Part-1] feat: Add decommisson logic to shuffle server (#606)
- [#630] feat(client): Disable the localShuffleReader by default in Spark3 client (#636)
- [#632]fix: respect volumes in rss spec (#634)
- [#600] chore(operator): change JDK base from openjdk to eclipse-temurin (#617)
- [#626][FOLLOWUP] chore(ci): fix typo in build.yml (#633)
- [#627] fix(operator): support specifying custom ports (#629)
- [#626] chore(ci): skip build operator if no code changes (#628)
- [#580] chore(ci): disable parallel build in maven (#631)
- feat: support downloads latest hadoop archives and mirror url (#619)
- [FOLLOWUP] fix: don't recreate base dir if it's already existed (#616)
- [#551] docs: update templates for flaky test and pull request (#588)
- [MINOR] fix: allow
mountPoint
not containing '/' - [#613] test: SimpleClusterManagerTest#updateExcludeNodesTest (#614)