目錄 Contents
本書贊譽
前言
第1章 環境準備 ········································1
1.1 運行環境準備 ···········································2
1.1.1 安裝JDK ·········································2
1.1.2 安裝Scala ········································2
1.1.3 安裝Spark ·······································3
1.2 Spark初體驗 ···································4
1.2.1 運行spark-shell ·······························4
1.2.2 執行word count ······························5
1.2.3 剖析spark-shell ·······························9
1.3 閱讀環境準備 ·········································14
1.3.1 安裝SBT ·······································15
1.3.2 安裝Git ·········································15
1.3.3 安裝Eclipse Scala IDE插件 ········15
1.4 Spark源碼編譯與調試 ·························17
1.5 小結 ···························23
第2章 設計理念與基本架構 ···············24
2.1 初識Spark ··································25
2.1.1 Hadoop MRv1的局限···················25
2.1.2 Spark的特點 ·································26
2.1.3 Spark使用場景 ·····························28
2.2 Spark基礎知識 ······································29
2.3 Spark基本設計思想 ·····························31
2.3.1 Spark模塊設計 ·····························32
2.3.2 Spark模型設計 ·····························34
2.4 Spark基本架構 ···································36
2.5 小結 ·································38
第3章 Spark基礎設施 ·························39
3.1 Spark配置 ········································40
3.1.1 係統屬性中的配置 ·······················40
3.1.2 使用SparkConf配置的API ·········41
3.1.3 剋隆SparkConf配置 ····················42
3.2 Spark內置RPC框架 ····························42
3.2.1 RPC配置TransportConf ··············45
3.2.2 RPC客戶端工廠Transport- ClientFactory ·······················47
3.2.3 RPC服務端TransportServer ········53
3.2.4 管道初始化 ···································56
3.2.5 TransportChannelHandler詳解 ·····57
3.2.6 服務端RpcHandler詳解 ··············63
3.2.7 服務端引導程序Transport-ServerBootstrap ·····················68
3.2.8 客戶端TransportClient詳解 ········71
3.3 事件總綫 ····································78
3.3.1 ListenerBus的繼承體係 ···············79
3.3.2 SparkListenerBus詳解 ··················80
3.3.3 LiveListenerBus詳解 ····················83
3.4 度量係統 ···········································87
3.4.1 Source繼承體係 ···························87
3.4.2 Sink繼承體係 ·······························89
3.5 小結 ·········································92
第4章 SparkContext的初始化 ·········93
4.1 SparkContext概述 ·································94
4.2 創建Spark環境 ·····································97
4.3 SparkUI的實現 ····································100
4.3.1 SparkUI概述 ·······························100
4.3.2 WebUI框架體係 ·························102
4.3.3 創建SparkUI ·······························107
4.4 創建心跳接收器 ··································111
4.5 創建和啓動調度係統··························112
4.6 初始化塊管理器BlockManager ·······114
4.7 啓動度量係統 ·······························114
4.8 創建事件日誌監聽器··························115
4.9 創建和啓動ExecutorAllocation-Manager ··························116
4.10 ContextCleaner的創建與啓動 ········120
4.10.1 創建ContextCleaner ·················120
4.10.2 啓動ContextCleaner ·················120
4.11 額外的SparkListener與啓動事件總綫 ··························122
4.12 Spark環境更新 ··································123
4.13 SparkContext初始化的收尾 ···········127
4.14 SparkContext提供的常用方法 ·······128
4.15 SparkContext的伴生對象················130
4.16 小結 ····································131
第5章 Spark執行環境 ························132
5.1 SparkEnv概述 ·································133
5.2 安全管理器SecurityManager ············133
5.3 RPC環境 ·········································135
5.3.1 RPC端點RpcEndpoint ···············136
5.3.2 RPC端點引用RpcEndpointRef ···139
5.3.3 創建傳輸上下文TransportConf ···142
5.3.4 消息調度器Dispatcher ···············142
5.3.5 創建傳輸上下文Transport-Context ·························154
5.3.6 創建傳輸客戶端工廠Transport-ClientFactory ····················159
5.3.7 創建TransportServer ···················160
5.3.8 客戶端請求發送 ·························162
5.3.9 NettyRpcEnv中的常用方法 ·······173
5.4 序列化管理器SerializerManager ·····175
5.5 廣播管理器BroadcastManager ·········178
5.6 map任務輸齣跟蹤器 ··························185
5.6.1 MapOutputTracker的實現 ··········187
5.6.2 MapOutputTrackerMaster的實現原理 ·······················191
5.7 構建存儲體係 ·······································199
5.8 創建度量係統 ·······································201
5.8.1 MetricsCon?g詳解 ·····················203
5.8.2 MetricsSystem中的常用方法 ····207
5.8.3 啓動MetricsSystem ····················209
5.9 輸齣提交協調器 ··································211
5.9.1 OutputCommitCoordinator-Endpoint的實現 ··················211
5.9.2 OutputCommitCoordinator的實現 ··························212
5.9.3 OutputCommitCoordinator的工作原理 ························216
5.10 創建SparkEnv ····································217
5.11 小結 ·····································217
第6章 存儲體係 ·····································219
6.1 存儲體係概述 ·······································220
6.1.1 存儲體係架構 ·····························220
6.1.2 基本概念 ·····································222
6.2 Block信息管理器 ································227
6.2.1 Block鎖的基本概念 ···················227
6.2.2 Block鎖的實現 ···························229
6.3 磁盤Block管理器 ······························234
6.3.1 本地目錄結構 ·····························234
6.3.2 DiskBlockManager提供的方法 ···························236
6.4 磁盤存儲DiskStore ·····························239
6.5 內存管理器 ·····································242
6.5.1 內存池模型 ·································243
6.5.2 StorageMemoryPool詳解 ···········244
6.5.3 MemoryManager模型 ················247
6.5.4 Uni?edMemoryManager詳解 ····250
6.6 內存存儲MemoryStore ······················252
6.6.1 MemoryStore的內存模型 ··········253
6.6.2 MemoryStore提供的方法 ··········255
6.7 塊管理器BlockManager ····················265
6.7.1 BlockManager的初始化 ·············265
6.7.2 BlockManager提供的方法 ·········266
6.8 BlockManagerMaster對Block-Manager的管理 ·················285
6.8.1 BlockManagerMaster的職責 ······285
6.8.2 BlockManagerMasterEndpoint詳解 ·································286
6.8.3 BlockManagerSlaveEndpoint詳解 ·····························289
6.9 Block傳輸服務 ····································290
6.9.1 初始化NettyBlockTransfer-Service ···························291
6.9.2 NettyBlockRpcServer詳解 ·········292
6.9.3 Shuf?e客戶端 ·····························296
6.10 DiskBlockObjectWriter詳解 ···········305
6.11 小結 ·······································308
第7章 調度係統 ·····································309
7.1 調度係統概述 ·······································310
7.2 RDD詳解 ·····································312
7.2.1 為什麼需要RDD ························312
7.2.2 RDD實現的初次分析 ················313
7.2.3 RDD依賴 ····································316
7.2.4 分區計算器Partitioner················318
7.2.5 RDDInfo ······································320
7.3 Stage詳解 ········································321
7.3.1 ResultStage的實現 ·····················322
7.3.2 Shuf?eMapStage的實現 ·············323
7.3.3 StageInfo ······································324
7.4 麵嚮DAG的調度器DAGScheduler ···326
7.4.1 JobListener與JobWaiter ·············326
7.4.2 ActiveJob詳解 ····························328
7.4.3 DAGSchedulerEventProcessLoop的簡要介紹 ·······················328
7.4.4 DAGScheduler的組成 ················329
7.4.5 DAGScheduler提供的常用方法 ···330
7.4.6 DAGScheduler與Job的提交 ····334
7.4.7 構建Stage····································337
7.4.8 提交ResultStage ························341
7.4.9 提交還未計算的Task ·················343
7.4.10 DAGScheduler的調度流程 ······347
7.4.11 Task執行結果的處理 ··············348
7.5 調度池Pool ······································351
7.5.1 調度算法 ·······························352
7.5.2 Pool的實現 ·································354
7.5.3 調度池構建器 ·····························357
7.6 任務集閤管理器TaskSetManager ···363
7.6.1 Task集閤 ·····································363
7.6.2 TaskSetManager的成員屬性 ······364
7.6.3 調度池與推斷執行 ·····················366
7.6.4 Task本地性 ·································370
7.6.5 TaskSetManager的常用方法 ······373
7.7 運行器後端接口LauncherBackend ···383
7.7.1 BackendConnection的實現 ········384
7.7.2 LauncherBackend的實現 ···········386
7.8 調度後端接口SchedulerBackend ····389
7.8.1 SchedulerBackend的定義 ··········389
7.8.2 LocalSchedulerBackend的實現分析 ································390
7.9 任務結果獲取器TaskResultGetter ···394
7.9.1 處理成功的Task ·························394
7.9.2 處理失敗的Task ·························396
7.10 任務調度器TaskScheduler ··············397
7.10.1 TaskSchedulerImpl的屬性 ·····397
7.10.2 TaskSchedulerImpl的初始化 ···399
7.10.3 TaskSchedulerImpl的啓動 ·····399
7.10.4 TaskSchedulerImpl與Task的提交 ·······················400
7.10.5 TaskSchedulerImpl與資源分配 ···························402
7.10.6 TaskSchedulerImpl的調度流程 ······························405
7.10.7 TaskSchedulerImpl對執行結果的處理 ·····························406
7.10.8 TaskSchedulerImpl的常用方法 ···409
7.11 小結 ·······································412
第8章 計算引擎 ·····································413
8.1 計算引擎概述 ·······································414
8.2 內存管理器與執行內存 ·····················417
8.2.1 ExecutionMemoryPool詳解 ·······417
8.2.2 MemoryManager模型與執行內存 ··························420
8.2.3 Uni?edMemoryManager與執行內存 ·······················421
8.3 內存管理器與Tungsten ·····················423
8.3.1 MemoryBlock詳解 ·····················423
8.3.2 MemoryManager模型與Tungsten ···························425
8.3.3 Tungsten的內存分配器 ··············425
8.4 任務內存管理器 ··································431
8.4.1 TaskMemoryManager詳解 ·········431
8.4.2 內存消費者 ·······················439
8.4.3 執行內存整體架構 ·····················441
8.5 Task詳解 ······································443
8.5.1 任務上下文TaskContext ············443
8.5.2 Task的定義 ·································446
8.5.3 Shuf?eMapTask的實現 ··············449
8.5.4 ResultTask的實現 ·······················450
8.6 IndexShuf?eBlockResolver詳解 ······451
8.7 采樣與估算 ···········································455
8.7.1 SizeTracker的實現分析 ·············455
8.7.2 SizeTracker的工作原理 ·············457
8.8 特質WritablePartitionedPair- Collection ······················458
8.9 AppendOnlyMap的實現分析 ···········460
8.9.1 AppendOnlyMap的容量增長 ····461
8.9.2 AppendOnlyMap的數據更新 ····462
8.9.3 AppendOnlyMap的緩存聚閤算法 ·····························464
8.9.4 AppendOnlyMap的內置排序 ····466
8.9.5 AppendOnlyMap的擴展 ············467
8.10 PartitionedPairBuffer的實現分析 ···469
8.10.1 PartitionedPairBuffer的容量增長 ······················469
8.10.2 PartitionedPairBuffer的插入 ···470
8.10.3 PartitionedPairBuffer的迭代器 ···471
8.11 外部排序器 ·········································472
8.11.1 ExternalSorter詳解 ·················473
8.11.2 Shuf?eExternalSorter詳解 ······487
8.12 Shuf?e管理器 ····································490
8.12.1 Shuf?eWriter詳解 ··················491
8.12.2 Shuf?eBlockFetcherIterator詳解 ······························502
8.12.3 BlockStoreShuf?eReader詳解 ···510
8.12.4 SortShuf?eManager詳解 ········513
8.13 map端與reduce端的Shuf?e組閤 ······························516
8.14 小結 ·········································519
第9章 部署模式 ········································520
9.1 心跳接收器HeartbeatReceiver ·········521
9.2 Executor的實現分析 ··························527
9.2.1 Executor的心跳報告 ··················528
9.2.2 運行Task ·····································530
9.3 local部署模式 ······································535
9.4 持久化引擎PersistenceEngine ··········537
9.4.1 基於文件係統的持久化引擎 ·····539
9.4.2 基於ZooKeeper的持久化引擎 ···541
9.5 領導選舉代理 ·······································542
9.6 Master詳解 ···········································546
9.6.1 啓動Master ·································549
9.6.2 檢查Worker超時························553
9.6.3 被選舉為領導時的處理 ·············554
9.6.4 一級資源調度 ·····························558
9.6.5 注冊Worker·································568
9.6.6 更新Worker的最新狀態············570
9.6.7 處理Worker的心跳····················570
9.6.8 注冊Application··························571
9.6.9 處理Executor的申請 ·················573
9.6.10 處理Executor的狀態變化 ·······573
9.6.11 Master的常用方法 ···················574
9.7 Worker詳解 ································578
9.7.1 啓動Worker·································581
9.7.2 嚮Master注冊Worker ···············584
9.7.3 嚮Master發送心跳 ····················589
9.7.4 Worker與領導選舉·····················591
9.7.5 運行Driver ··································593
9.7.6 運行Executor ······························594
9.7.7 處理Executor的狀態變化 ·········599
9.8 StandaloneAppClient實現 ·················600
9.8.1 ClientEndpoint的實現分析 ········601
9.8.2 StandaloneAppClient的實現分析 ······························606
9.9 StandaloneSchedulerBackend的實現分析 ························607
9.9.1 StandaloneSchedulerBackend的屬性 ····························607
9.9.2 DriverEndpoint的實現分析 ·······609
9.9.3 StandaloneSchedulerBackend的啓動 ··························614
9.9.4 StandaloneSchedulerBackend的停止 ·························617
9.9.5 StandaloneSchedulerBackend與資源分配 ················618
9.10 CoarseGrainedExecutorBackend詳解 ····························619
9.10.1 CoarseGrainedExecutorBackend進程 ··························620
9.10.2 CoarseGrainedExecutorBackend的功能分析 ·························622
9.11 local-cluster部署模式 ·······················625
9.11.1 啓動本地集群 ····························625
9.11.2 local-cluster部署模式的啓動過程 ·································627
9.11.3 local-cluster部署模式下Executor的分配過程 ·················628
9.11.4 local-cluster部署模式下的任務提交執行過程 ····························629
9.12 Standalone部署模式 ·························631
9.12.1 Standalone部署模式的啓動過程 ························632
9.12.2 Standalone部署模式下Executor的分配過程 ················634
9.12.3 Standalone部署模式的資源迴收 ·····························635
9.12.4 Standalone部署模式的容錯機製 ······························636
9.13 其他部署方案 ·····································639
9.13.1 YARN·········································639
9.13.2 Mesos ·········································644
9.14 小結 ·······································646
第10章 Spark API ································647
10.1 基本概念·····································648
10.2 數據源DataSource ····························650
10.2.1 DataSourceRegister詳解 ··········650
10.2.2 DataSource詳解 ························651
10.3 檢查點的實現 ···································655
10.3.1 CheckpointRDD的實現············655
10.3.2 RDDCheckpointData的實現 ····660
10.3.3 ReliableRDDCheckpointData的實現 ························662
10.4 RDD的再次分析 ·······························663
10.4.1 轉換API ····································663
10.4.2 動作API ····································665
10.4.3 檢查點API的實現分析 ···········667
10.4.4 迭代計算 ···································669
10.5 數據集閤Dataset ·······························671
10.6 DataFrameReader詳解 ·····················673
10.7 SparkSession詳解 ·····························676
10.7.1 SparkSession的構建器Builder ···676
10.7.2 SparkSession的API ·················679
10.8 word count例子 ·································679
10.8.1 Job準備階段 ·····························680
10.8.2 Job的提交與調度 ·····················685
10.9 小結 ········································689
附錄 ···········································690
· · · · · · (
收起)