OID CND Asia Slide: CurveFSdecreased by 21% and TPS improved by 26%Performance compare with CEPH The test environment ● A cluster consists of six nodes. Each node consists of 20 x SSDS, 256GB memory, and 2 x e5-2660 cpus Performance death 4s Slow response 1s frequentlyData availability analysis Copyset allocation algorithm cluster with 1200 disks (each have MTBF 1.2 million hours) 3 replicas Restore data on a disk within 5 CURVE Community You can take awayKey designs used by CURVE High availability and reliability ● Cluster topology ● CopySet pre-allocation algorithm ● Raft Consistency protocol High performance ● pre-created0 码力 | 24 页 | 3.47 MB | 6 月前3
Curve文件系统元数据持久化方案设计抽象成一个 KVStore,对外提供 SET/GET/DEL 等接口,inode/dentry 均编码后以 key-value 的形式存入 KVStore 当前实现可先只实现 KVStore(提供方便 API),Raft 等可以后续接入(目前实现中持久化可以在 KVStore 退出时触发持久化,或定时持久化)© XXX Page 8 of 12 class KVStore : public braft::StateMachine 的方案是不行了. redis 的高可用、高可扩方案? 主要是 redis cluster + 主从复制 (或者第三方 codis + 哨兵) redis cluster/codis 主要解决扩展性的问题,它会进行分片,每个 redis 实例保存分片的 key 主从复制主要解决高可用,一个分片实例挂 2 个从实例,当主节点挂掉时,cluster/哨兵会自动将从节点升为主节点 redis + muliraft0 码力 | 12 页 | 384.47 KB | 6 月前3
Curve Cloud NativetuningCloud native Feature list • Features for Cluster • Features for CurveBS • Features for CurveFS • Features for monitorFeature list for cluster • Install / uninstall/ upgrade and configure CurveBS/CurveFS CurveBS/CurveFS through helm chart • upgrade automation • Supporting Curve Cluster provisioning in helm chart • metadata backup and recovery • MDS / ChunkServer should respect failure domains of cloud environments • Dashboard-driven configuration after minimal Curve installFeature list for cluster • CurveBS mirroring configured with CRDs • Different Curve clusters may share MDS and ETCD server0 码力 | 9 页 | 2.85 MB | 6 月前3
Estimation of Availability and Reliability in CurveBSannual probability of cluster data loss is as follows: 𝑃 = (𝑃! ∗ 𝑃) ∗ 𝑃-) The probability of no data loss in a cluster is as follows: 1 − 𝑃 Assume that in a CurveBS cluster consisting of 1200 minutes, i.e. 0.083 hours. Therefore, the probability of no data loss in the annual of the CurveBS cluster is P = 0.9999997810 码力 | 2 页 | 34.51 KB | 6 月前3
CurveFS Copyset与FS对应关系partition管理的元数据,data partition管理数据。meta partition管理inode和dentry信息。 创建一个文件系统时,如何初始化meta partition? master\cluster.go, chubaofs的文件系统使用volume的来表示,在创建一个文件系统的时候,会创建3个meta partition和10个data partition。chubaofs的data p usage来选的,通常选择内存和disk使用率最低的节点。 并去对应的meta node上去创建对应的meta partition。 如何选择partition的host,通过这个函数去选择。 func (c *Cluster) (excludeZone , excludeNodeSets [] , excludeHosts [] , replicaNum , crossZone , specifiedZone partition使用第一个叫做MetaWrapper的结构体组织起来© XXX Page 7 of 19 type MetaWrapper struct { sync.RWMutex cluster string localIP string volname string …… // Partitions and0 码力 | 19 页 | 383.29 KB | 6 月前3
Curve for CNCF MainCSI plugin for CurveBS • Deploy CurveBS as container service (in Plan) • Config CurveBS by (Cluster and Pool CRDs) in Kubernetes (in Plan) • Support Operator capability level 5 (in Plan) • horizontal occasionally CAN SYNC WITH REMOTE DISK SERVER Y NI/O Jitter (vs. Ceph) 3 replicas with 9 nodes cluster each node has 20 x SSD, 2xE5-2660 v4 and 256GB mem FAULTS CASE CURVE I/O JITTER CEPH I/O JITTER plugin for CurveFS (in Plan) • Deploy CurveFS as container service (in Plan) • Config CurveFS by (cluster and storage pools) CRDs in Kubernetes (in Plan) • Support Operator capability level 5 (in Plan)0 码力 | 21 页 | 4.56 MB | 6 月前3
CurveBS IO Processing FlowServer (MDS) l Manages and stores metadata information and persists the data in ETCD l Collect cluster status and schedule. 2. Chunkserverl Responsible for data storage l Multi-replicas consistency 1. Fs-meta Cluster is used to manage the inode and dentry metadata of files. The architecture is like CurveBS, so metadata scalability is very good in this way. 2. Fs-data cluster is used to store0 码力 | 13 页 | 2.03 MB | 6 月前3
Curve核心组件之mds – 网易数帆在物理pool上,还创建了一个逻辑pool,逻辑pool使用3个zone,采用 3副本,有100个copyset。 cluster pool1 zone1 zone2 zone3 server1 server2 server3 192.168.0.1:8200 192.168.0.2:8200 192.168.0.3:8200 cluster_map: servers: - name: server1 internalip: OPYSET Copyset的生成策略:Source code : curve/src/mds/copyset/ bool GenCopyset(const ClusterInfo& cluster, int numCopysets, std::vector* out); 例如要在(zone1 zone2 zone3 zone4)中创建8个copyset: result: 0 码力 | 23 页 | 1.74 MB | 6 月前3
Curve元数据节点高可用11 of 30 $ ETCDCTL_API=3 ./bin/etcdctl put foo bar $ ETCDCTL_API=3 ./bin/etcdctl get foo --write-out=json revision: 2 $ ETCDCTL_API=3 ./bin/etcdctl put foo bar $ ETCDCTL_API=3 ./bin/etcdctl get ETCDCTL_API=3 ./bin/etcdctl put hello world $ ETCDCTL_API=3 ./bin/etcdctl get foo --write-out=json revision: 4 $ ETCDCTL_API=3 ./bin/etcdctl get hello --write-out=json revision: 4 $ ETCDCTL_API=3 ./bin/etcdctl /bin/etcdctl put hello world $ ETCDCTL_API=3 ./bin/etcdctl get hello --write-out=json revision: 5 3.2.2 举例说明Campagin流程 场景描述:三个mds(mds1, mds2, mds3),希望实现一个mds作为主提供服务,另外两个mds作为备在主挂掉的时候提供服务的功能。如果利用上述的Campagin进行选举,过程如下:0 码力 | 30 页 | 2.42 MB | 6 月前3
NJSD eBPF 技术文档 - 0924版本Curve分布式⽂件存储 • ⾼性能、易运维、云原⽣Curve⽂件系统框架和主要应⽤场景 • AI机器学习场景 • ⼤数据计算场景 • 中间件数据存储场景 • ⽀持POSIX兼容的⽂件API • ⽀持低延迟的⽂件数据访问Curve⽂件系统⾯临的问题 • ⽤户态实现 • 稳定性/可靠性⾼ • 容易更新及维护 • 基于FUSE提供POSIX兼容⽂件接⼝ • 问题 • 瓶颈在/dev/fuse通讯开销基于FUSE可能的优化点 • 降低内核与libfuse通讯延迟 • 基于⽂件属性的操作内核直接返回? • 基于⽂件数据的操作先内核读写 cache?实现POSIX兼容API途径及问题 • 基于FUSE的实现 • curve / ceph / gluster • LD_PRELOAD重载⽂件系统系统调⽤ • vpp / f-stack / DirectFUSE google android12 passthrough什么是eBPF • ebpf是不同环境下内核配置, 调试,监控⼯具 • map映射 • 验证器 • Hook • Helper api配置TCP Initial RTO • 场景 内核4.12之前 initial RTO是⼀个常数1s • 应⽤类型BPF_PROG_TYPE_SOCK_OPS • HOOK BPF_SOCK_OPS_TIMEOUT_INIT0 码力 | 20 页 | 7.40 MB | 6 月前3
共 16 条
- 1
- 2













