元数据导出--湖仓一体分析服务 LAS-火山引擎

文档中心

导航

元数据导出

最近更新时间：2025.03.19 16:10:50首次发布时间：2025.03.19 16:10:50

背景

为了支持 Las Catalog 的元数据容灾备份，Las Catalog 提供了元数据导出工具，用户可以通过 Spark 作业的方式，将自己在 Las Catalog 中的元数据导出到客户的 HMS 中或 TOS 中。本文介绍如何通过运行 Spark 作业的方式导出元数据。

环境准备

已创建 EMR on ECS 集群(默认有Spark 和Hive的服务)。
Hive 的元数据管理使用外建数据库或内置数据库，禁止使用 Las 作为元数据存储。
同步任务利用 EMR 集群的 Metastore 服务，通过运行 Spark 作业的方式在该 EMR 集群上运行来实现导出。

操作步骤

准备配置文件

mode: las_to_hive
lasClientInfo:
  akMode: AUTO
  endPoint: thrift://cs.lakeformation.las.cn-beijing.ivolces.com:48869
  regionId: cn-beijing
hmsClientInfo:
  hiveConfPath: /etc/emr/hive/conf/hive-site.xml
runOptions:
  includeCatalogPrefixs: [las_exporter]
  includeDatabasePrefixs: [las_exporter]
  includeTablePrefixs: [exporter_table_1000]
  batchSize: 1000
  objectTypes:
    - catalog
    - database
    - table
    - partition
    - function

上传到 HDFS

--上传到当前目录
hadoop fs -put application.jar ./application.jar

上传到TOS

--上传到 tos://exporter_bucket/exporter_las 目录下
hadoop fs -put export_to_hive.yaml tos://exporter_bucket/exporter_las/export_to_hive.yaml

配置项说明

参数名		默认值	备注说明
mode 导出模式		las_to_tos	可取值 las_to_tos：LAS 导出到 TOS las_to_hive: LAS 直接导出到 HIVE tos_to_hive: TOS 导入到 HIVE
lasClientInfo las的客户端配置参数	akMode	MANUAL	可取值： MANUAL AUTO 说明 Auto 仅半托管可用，不用填写 ak/sk。全托管只支持 MANUAL。
	accessKeyId	—	访问 LAS 元数据的用户 ID，当 akMode 是 AUTO 时，可以为空
	accessKeySecret	—	访问 LAS 元数据的用户密码，当 akMode 是 AUTO 时，可以为空
	endPoint	—	取值：北京：thrift://cs.lakeformation.las.cn-beijing.ivolces.com:48869 上海：thrift://cs.lakeformation.las.cn-shanghai.ivolces.com:48869 广州：thrift://cs.lakeformation.las.cn-guangzhou.ivolces.com:48869 柔佛：thrift://cs.lakeformation.las.ap-southeast-1.ivolces.com:48869 北京自驾专区：thrift://cs.lakeformation.las.cn-beijing-selfdrive.ivolces.com:48869
	regionId	cn-beijing	可取值 cn-beijing cn-shanghai cn-guangzhou ap-southeast-1 cn-beijing-selfdrive
hmsClientInfo 半托管环境需要配置	hiveConfPath	/etc/emr/hive/conf/hive-site.xml	半托管环境配置，hive 的 conf 文件所在的地址
	kerberosInfo.principal	—	半托管开启Kerberos需要配置，访问 hms 的 principl
	kerberosInfo.keytab	—	半托管开启Kerberos需要配置，访问 hms 的 keytab 地址
hmsClientInfo 全托管环境需要配置	metastoreUris	—	要导出的目标 HiveMetastore uri 地址，比如 thrift://xxx:9083
	jdbcDriver	com.mysql.cj.jdbc.Driver	要导出的目标 HiveMetastore 使用的数据库连接 Driver 类
	jdbcUri	—	要导出的目标 HiveMetastore 使用的数据库连接 URI jdbc:mysql://mysqlcdxx.rds.ivolces.com:3306/hive_metastore_v3?characterEncoding=utf8
	jdbcUserName	—	要导出的目标HiveMetastore 使用的数据库连接用户名
	jdbcPassword	—	要导出的目标HiveMetastore 使用的数据库连接密码
runOptions 运行时需要使用的配置参数	includeCatalogPrefixs	[hive]	要导出的 catalog 名字前缀，因为运行环境的原因，目前只支持单个 catalog 的导入，不支持多catalog同时导入。注意如果 catalog 的名称不是默认值 hive，那么用户侧的 HMS 中需要设置metastore.catalog.default 为该名称。
	includeDatabasePrefixs	—	前缀名称匹配的导出，用来过滤需要导出的database，可以设置多个若不填写，表示全部导出
	includeTablePrefixs	—	前缀名称匹配的导出，用来过滤需要导出的table，可以设置多个若不填写，表示全部导出
	includeFunctionPrefixs	—	前缀名称匹配的导出，用来过滤需要导出的function，可以设置多个若不填写，表示全部导出
	objectTypes	—	要导出的对象类型，可以同时选择多个类型可用值： catalog database table partition function
	batchSize	1000	读取las或者写入 hive/tos 每个 batch 大小，当前最大值是 1000
	sparkTaskBatchSize	200000	单个 sparkTask 处理的元数据分区个数。导出的分区个数超过20w会增加一个 spark task 处理。
	locationMappings source target	—	如果模式设置的是的 tos_to_hive，可以配置填写地址的映射（不填默认 LocationUri 保持一致），source 表示 tos 中存储的元数据的 LocationUri (注意和 tos 本身的地址做区分），target 表示要导入到的 hive 中的 locationUri，需要反斜杠'/'结尾，例如: source: tos://exporter-test/exporter_las/2025-02-18/ target: tos://exporter-test/metadata/warehouse/
	fixIfInConsistence	ignore	如果元数据在 Las 和 Hive，库/表名字都存在，一致性处理策略可取值： ignore：忽略，当前只支持这个 dropAndCreate：删除后重建 update：直接update
	outputBaseDir	—	若导出模式是导出到 TOS 中，则需要配置比如：tos://exporter_bucket/exporter_las 导出 TOS 的时候，会在该目录再创建以 yyyy-MM-dd 格式的分区文件夹
	inputBaseDir	—	若从 TOS 导入数据到 Hive ，则需要配置比如：tos://exporter_bucket/exporter_las/2025-01-01 指定从 TOS 导入哪天的分区数据

执行导出 Spark 任务

如果要从 LAS 中导出 Catalog 名称不是 Hive，需要在半托管的 EMR 集群中 Hive 服务的 hive-site.xml 文件中的配置metastore.catalog.default=XX**，**XX是需要导出的Catalog名称，配置完成后需重启 HiveMetastore。
登录到 EMR 集群 master 节点。
下载 LAS 元数据导出 jar 包，地址如下：

wget https://lasformation-cn-beijing.tos-cn-beijing.ivolces.com/las-exporter/application.jar

将下载的 jar 包和配置文件上传到自己的 HDFS/TOS 上。

例示：将下载的 jar 包和配置文件上传到 tos://exporter_bucket/exporter_las目录下。

hadoop fs -put application.jar tos://exporter_bucket/exporter_las/application.jar

运行命令

jar 包和配置文件都在 HDFS 路径下。

spark-submit --master yarn --deploy-mode cluster --driver-memory 12G --executor-memory 8G --executor-cores 2 --num-executors 5 --conf spark.sql.shuffle.partitions=200 --conf spark.sql.adaptive.enabled=false --class bytedance.olap.las.Exporter ./application.jar ./export_to_hive.yaml

jar 包和配置文件都在 TOS 路径下。

spark-submit --master yarn --deploy-mode cluster --driver-memory 12G --executor-memory 8G --executor-cores 2 --num-executors 5 --conf spark.sql.shuffle.partitions=200 --conf spark.sql.adaptive.enabled=false --class bytedance.olap.las.Exporter tos://exporter_bucket/exporter_las/application.jar tos://exporter_bucket/exporter_las/export_to_hive.yaml

上面命令中的加粗的路径，需要跟第3步中的application.jar和配置文件路径相同。如果分区数量较多，可以适当调大num-executors的数量，比如100w分区配置100个，这样能增加写入的并行度，加快写入的速度。

查看结果

从 Spark History Server UI 中查看本次导出任务的执行详情日志，或者从Hive 中直接通过命令行的方式查看具体的元数据信息是否导入成功。