如何进行Docker和Kubernetes中使用Ceph RBD卷的原理分析,针对这个问题,这篇文章详细介绍了相对应的分析和解答,希望可以帮助更多想解决这个问题的小伙伴找到更简单易行的方法。
在Docker
或者Kubernetes
中使用Ceph RBD
块设备,相比于在宿主机中,是否会对性能造成额外损失?带着这些疑问对相关技术进行原理分析。
Linux中的Mount绑定传播
Linux的Mount绑定关系
Linux
Mount
命名空间通过隔离文件系统挂载点对隔离文件系统提供支持,它是历史上第一个 Linux
Namespace
,所以它的标识位比较特殊,就是 CLONE_NEWNS
。隔离后,不同 Mount
Namespace
中的文件结构发生变化也互不影响。你可以通过 /proc/[pid]/mounts
查看到所有挂载在当前 Namespace 中的文件系统,还可以通过 /proc/[pid]/mountstats
看到 Mount
Namespace
中文件设备的统计信息,包括挂载文件的名字、文件系统类型、挂载位置等等。
进程在创建 Mount
Namespace
时,会把当前的文件结构复制给新的 Namespace
。新 Namespace
中的所有 Mount 操作都只影响自身的文件系统,而对外界不会产生任何影响。这样做非常严格地实现了隔离,但是某些情况可能并不适用。比如父节点 Namespace
中的进程挂载了一张 CD-ROM
,这时子节点 Namespace
拷贝的目录结构就无法自动挂载上这张 CD-ROM
,因为这种操作会影响到父节点的文件系统。
2006 年引入的挂载传播(Mount
Propagation
)解决了这个问题,挂载传播定义了挂载对象(Mount
Object
)之间的关系,系统用这些关系决定任何挂载对象中的挂载事件如何传播到其他挂载对象(参考自:http://www.ibm.com/developerworks/library/l-mount-namespaces/)。所谓传播事件,是指由一个挂载对象的状态变化导致的其它挂载对象的挂载与解除挂载动作的事件。
一个挂载状态可能为如下的其中一种:
传播事件的挂载对象称为共享挂载(Shared
Mount
);接收传播事件的挂载对象称为从属挂载(Slave
Mount
)。既不传播也不接收传播事件的挂载对象称为私有挂载(Private
Mount
)。另一种特殊的挂载对象称为不可绑定的挂载(Unbindable
Mount
),它们与私有挂载相似,但是不允许执行绑定挂载,即创建 Mount
Namespace
时这块文件对象不可被复制。
共享挂载的应用场景非常明显,就是为了文件数据的共享所必须存在的一种挂载方式;从属挂载更大的意义在于某些“只读”场景;私有挂载其实就是纯粹的隔离,作为一个独立的个体而存在;不可绑定挂载则有助于防止没有必要的文件拷贝,如某个用户数据目录,当根目录被递归式的复制时,用户目录无论从隐私还是实际用途考虑都需要有一个不可被复制的选项。
$ mount --make-shared <mount-object>
从共享挂载克隆的挂载对象也是共享的挂载;它们相互传播挂载事件。
$ mount --make-slave <shared-mount-object>
从从属挂载克隆的挂载对象也是从属的挂载,它也从属于原来的从属挂载的主挂载对象。
$ mount --make-shared <slave-mount-object>
$ mount --make-private <mount-object>
$ mount --make-unbindable <mount-object>
Linux的Mount绑定测试
$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 222.6G 0 disk
├─sda1 8:1 0 200M 0 part /boot
└─sda2 8:2 0 222.4G 0 part
├─centos-root 253:0 0 122.4G 0 lvm /
└─centos-home 253:1 0 100G 0 lvm /home
$ mkdir /opt/tmp /mnt/tmp /mnt/tmp1 /mnt/tmp2
$ mount --bind /opt/tmp /mnt/tmp
$ mount --bind /mnt/tmp1 /mnt/tmp2
$ cat /proc/self/mountinfo | grep /mnt/tmp
549 40 253:0 /opt/tmp /mnt/tmp rw,relatime shared:1 - xfs /dev/mapper/centos-root rw,seclabel,attr2,inode64,noquota
583 40 253:0 /mnt/tmp1 /mnt/tmp2 rw,relatime shared:1 - xfs /dev/mapper/centos-root rw,seclabel,attr2,inode64,noquota
可以看到两个绑定目录都是共享的,且共享ID
为1
,父目录在253:0
设备上。
在Docker中使用数据卷的主要方式
参考文档:
Manage data in Docker;
Use bind mounts。
$ docker run --rm -it -v /data1 centos:7 bash
# Or
$ docker run --rm -it -v data1:/data1 centos:7 bash
# Or
$ docker run --rm -it --mount target=/data1 centos:7 bash
# Or
$ docker run --rm -it --mount type=volume,target=/data1 centos:7 bash
# Or
$ docker run --rm -it --mount type=volume,source=data1,target=/data1 centos:7 bash
$ docker ps | awk 'NR==2 {print $1}' | xargs -i docker inspect -f '{{.State.Pid}}' {} | xargs -i cat /proc/{}/mountinfo | grep data
1029 1011 253:0 /var/lib/docker/volumes/239be79a64f7fa6ec815b1d9f2a7773a678ee5c8c1150f03ca81b0d5177b36a0/_data /data1 rw,relatime master:1 - xfs /dev/mapper/centos-root rw,seclabel,attr2,inode64,noquota
$ docker run --rm -it -v /opt:/data2 centos:7 bash
# Or
$ docker run --rm -it --mount type=bind,source=/opt,target=/data2 centos:7 bash
$ docker ps | awk 'NR==2 {print $1}' | xargs -i docker inspect -f '{{.State.Pid}}' {} | xargs -i cat /proc/{}/mountinfo | grep data
1029 1011 253:0 /opt /data2 rw,relatime - xfs /dev/mapper/centos-root rw,seclabel,attr2,inode64,noquota
$ docker create --name vc -v /data1 centos:7
$ docker run --rm -it --volumes-from vc centos:7 bash
$ docker ps | awk 'NR==2 {print $1}' | xargs -i docker inspect -f '{{.State.Pid}}' {} | xargs -i cat /proc/{}/mountinfo | grep data
1029 1011 253:0 /var/lib/docker/volumes/fe71f2d0ef18beb92cab7b99afcc5f501e47ed18224463e8c1aa1e8733003803/_data /data1 rw,relatime master:1 - xfs /dev/mapper/centos-root rw,seclabel,attr2,inode64,noquota
编辑Dockerfile
FROM busybox:latest
ADD htdocs /usr/local/apache2/htdocs
VOLUME /usr/local/apache2/htdocs
创建容器
$ mkdir htdocs
$ echo `date` > htdocs/test.txt
$ docker build -t volume-test .
$ docker create --name vc2 -v /data1 volume-test
$ docker run --rm -it --volumes-from vc2 volume-test sh
/ # cat /proc/self/mountinfo | grep htdocs
1034 1011 253:0 /var/lib/docker/volumes/54f47af60b8fb25602f022dcd8ad5b3e1a93a2d20c1045184a70391d9bed69b6/_data /usr/local/apache2/htdocs rw,relatime master:1 - xfs /dev/mapper/centos-root rw,seclabel,attr2,inode64,noquota
$ docker ps | awk 'NR==2 {print $1}' | xargs -i docker inspect -f '{{.State.Pid}}' {} | xargs -i cat /proc/{}/mountinfo | grep htdocs
1034 1011 253:0 /var/lib/docker/volumes/54f47af60b8fb25602f022dcd8ad5b3e1a93a2d20c1045184a70391d9bed69b6/_data /usr/local/apache2/htdocs rw,relatime master:1 - xfs /dev/mapper/centos-root rw,seclabel,attr2,inode64,noquota
$ docker run --rm -it --mount type=tmpfs,target=/data1 centos:7 bash
$ docker ps | awk 'NR==2 {print $1}' | xargs -i docker inspect -f '{{.State.Pid}}' {} | xargs -i cat /proc/{}/mountinfo | grep data
1029 1011 0:160 / /data1 rw,nosuid,nodev,noexec,relatime - tmpfs tmpfs rw,seclabel
在Docker中使用块设备的相关测试
$ docker run --rm -it -v /data1 -v /opt:/data2 centos:7 bash
[root@4282b3df2417 /]# mount | grep data
/dev/sdb1 on /data2 type xfs (rw,relatime,attr2,inode64,noquota)
/dev/sda1 on /data1 type xfs (rw,relatime,attr2,inode64,noquota)
$ docker inspect 4282b3df2417 | grep -i pid
"Pid": 12797,
"PidMode": "",
"PidsLimit": 0,
$ cat /proc/12797/mounts | grep data
/dev/sdb1 /data2 xfs rw,relatime,attr2,inode64,noquota 0 0
/dev/sda1 /data1 xfs rw,relatime,attr2,inode64,noquota 0 0
$ docker run --rm -it --device /dev/sdc:/dev/sdc centos:7 bash
[root@55423f5eaeea /]# mkfs -t minix /dev/sdc
21856 inodes
65535 blocks
Firstdatazone=696 (696)
Zonesize=1024
Maxsize=268966912
[root@55423f5eaeea /]# mknod /dev/sdd b 8 48
[root@55423f5eaeea /]# mkfs -t minix /dev/sdd
mkfs.minix: cannot open /dev/sdd: Operation not permitted
[root@55423f5eaeea /]# rm /dev/sdc
rm: remove block special file '/dev/sdc'? y
[root@55423f5eaeea /]# mknod /dev/sdc b 8 32
[root@55423f5eaeea /]# mkfs -t minix /dev/sdc
21856 inodes
65535 blocks
Firstdatazone=696 (696)
Zonesize=1024
Maxsize=268966912
[root@55423f5eaeea /]# mount /dev/sdc mnt/
[root@55423f5eaeea /]# mount: permission denied
[root@55423f5eaeea /]# dd if=/dev/sdc of=/dev/null bs=512 count=10
10+0 records in
10+0 records out
5120 bytes (5.1 kB) copied, 0.000664491 s, 7.7 MB/s
[root@55423f5eaeea /]# dd if=/dev/zero of=/dev/sdc bs=512 count=10
10+0 records in
10+0 records out
5120 bytes (5.1 kB) copied, 0.00124138 s, 4.1 MB/s
$ docker run --rm -it --privileged=true centos:7 bash
[root@b5c40e199476 /]# mount /dev/sdc mnt
[root@b5c40e199476 /]# mkfs -t minix /dev/sdc
mount: unknown filesystem type 'minix'
[root@b5c40e199476 /]# yum install -y xfsprogs
[root@b5c40e199476 /]# mkfs.xfs /dev/sdc -f
meta-data=/dev/sdc isize=512 agcount=4, agsize=6553600 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=0, sparse=0
data = bsize=4096 blocks=26214400, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=4096 blocks=12800, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
[root@b5c40e199476 /]# mount /dev/sdc mnt
[root@b5c40e199476 /]# df -h
Filesystem Size Used Avail Use% Mounted on
overlay 30G 19G 12G 62% /
tmpfs 910M 0 910M 0% /dev
tmpfs 910M 0 910M 0% /sys/fs/cgroup
/dev/sda1 30G 19G 12G 62% /etc/hosts
shm 64M 0 64M 0% /dev/shm
/dev/sdc 100G 33M 100G 1% /mnt
[root@b5c40e199476 /]# echo `date` > /mnt/time.txt
[root@b5c40e199476 /]# cat /mnt/time.txt
Wed Mar 6 12:23:05 UTC 2019
Kubernetes中的块设备使用和实现
// pkg/kubelet/kubelet.go
// setupDataDirs creates:
// 1. the root directory
// 2. the pods directory
// 3. the plugins directory
// 4. the pod-resources directory
func (kl *Kubelet) setupDataDirs() error {
...
if err := kl.mounter.MakeRShared(kl.getRootDir()); err != nil {
return fmt.Errorf("error configuring root directory: %v", err)
}
...
}
// pkg/util/mount/nsenter_mount.go
func (n *NsenterMounter) MakeRShared(path string) error {
return doMakeRShared(path, hostProcMountinfoPath)
}
// pkg/util/mount/mount_linux.go
// doMakeRShared is common implementation of MakeRShared on Linux. It checks if
// path is shared and bind-mounts it as rshared if needed. mountCmd and
// mountArgs are expected to contain mount-like command, doMakeRShared will add
// '--bind <path> <path>' and '--make-rshared <path>' to mountArgs.
func doMakeRShared(path string, mountInfoFilename string) error {
shared, err := isShared(path, mountInfoFilename)
if err != nil {
return err
}
if shared {
klog.V(4).Infof("Directory %s is already on a shared mount", path)
return nil
}
klog.V(2).Infof("Bind-mounting %q with shared mount propagation", path)
// mount --bind /var/lib/kubelet /var/lib/kubelet
if err := syscall.Mount(path, path, "" /*fstype*/, syscall.MS_BIND, "" /*data*/); err != nil {
return fmt.Errorf("failed to bind-mount %s: %v", path, err)
}
// mount --make-rshared /var/lib/kubelet
if err := syscall.Mount(path, path, "" /*fstype*/, syscall.MS_SHARED|syscall.MS_REC, "" /*data*/); err != nil {
return fmt.Errorf("failed to make %s rshared: %v", path, err)
}
return nil
}
$ echo 'apiVersion: v1
kind: Pod
metadata:
name: nginx-test
spec:
containers:
- name: nginx
image: nginx:latest
volumeMounts:
- name: nginx-test-vol1
mountPath: /data/
readOnly: false
volumes:
- name: nginx-test-vol1
persistentVolumeClaim:
claimName: nginx-test-vol1-claim' | kubectl create -f -
pod/nginx-test created
$ kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
nginx-test-vol1-claim Bound pvc-d6f6b6f8-446c-11e9-bbd8-6c92bf74be54 10Gi RWO ceph-rbd 114s
$ kubectl describe pvc nginx-test-vol1-claim
Name: nginx-test-vol1-claim
Namespace: default
StorageClass: ceph-rbd
Status: Bound
Volume: pvc-d6f6b6f8-446c-11e9-bbd8-6c92bf74be54
Labels: <none>
Annotations: pv.kubernetes.io/bind-completed: yes
pv.kubernetes.io/bound-by-controller: yes
volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/rbd
Finalizers: [kubernetes.io/pvc-protection]
Capacity: 10Gi
Access Modes: RWO
VolumeMode: Filesystem
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ProvisioningSucceeded 6m36s persistentvolume-controller Successfully provisioned volume pvc-d6f6b6f8-446c-11e9-bbd8-6c92bf74be54 using kubernetes.io/rbd
Mounted By: nginx-test
$ kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-d6f6b6f8-446c-11e9-bbd8-6c92bf74be54 10Gi RWO Delete Bound default/nginx-test-vol1-claim ceph-rbd 105s
$ kubectl describe pv pvc-d6f6b6f8-446c-11e9-bbd8-6c92bf74be54
Name: pvc-d6f6b6f8-446c-11e9-bbd8-6c92bf74be54
Labels: <none>
Annotations: kubernetes.io/createdby: rbd-dynamic-provisioner
pv.kubernetes.io/bound-by-controller: yes
pv.kubernetes.io/provisioned-by: kubernetes.io/rbd
Finalizers: [kubernetes.io/pv-protection]
StorageClass: ceph-rbd
Status: Bound
Claim: default/nginx-test-vol1-claim
Reclaim Policy: Delete
Access Modes: RWO
VolumeMode: Filesystem
Capacity: 10Gi
Node Affinity: <none>
Message:
Source:
Type: RBD (a Rados Block Device mount on the host that shares a pod's lifetime)
CephMonitors: [172.29.201.125:6789 172.29.201.126:6789 172.29.201.201:6789]
RBDImage: kubernetes-dynamic-pvc-db7fcd29-446c-11e9-af81-6c92bf74be54
FSType:
RBDPool: k8s
RadosUser: k8s
Keyring: /etc/ceph/keyring
SecretRef: &SecretReference{Name:ceph-k8s-secret,Namespace:,}
ReadOnly: false
Events: <none>
$ rbd ls -p k8s
kubernetes-dynamic-pvc-db7fcd29-446c-11e9-af81-6c92bf74be54
$ lsblk | grep rbd0
rbd0 252:0 0 10G 0 disk /var/lib/kubelet/pods/18a8fb7b-446d-11e9-bbd8-6c92bf74be54/volumes/kubernetes.io~rbd/pvc-d6f6b6f8-446c-11e9-bbd8-6c92bf74be54
$ cat /proc/self/mountinfo | grep rbd0
313 40 252:0 / /var/lib/kubelet/plugins/kubernetes.io/rbd/mounts/k8s-image-kubernetes-dynamic-pvc-db7fcd29-446c-11e9-af81-6c92bf74be54 rw,relatime shared:262 - ext4 /dev/rbd0 rw,seclabel,stripe=1024,data=ordered
318 40 252:0 / /var/lib/kubelet/pods/18a8fb7b-446d-11e9-bbd8-6c92bf74be54/volumes/kubernetes.io~rbd/pvc-d6f6b6f8-446c-11e9-bbd8-6c92bf74be54 rw,relatime shared:262 - ext4 /dev/rbd0 rw,seclabel,stripe=1024,data=ordered
可以看到RBD
被挂载在两个位置,一个是Pod
的Volume
目录,还有一个是RBD
插件目录,而且这两个目录都是shared:262
,说明这两个目录是被绑定的。
$ cat /proc/self/mountinfo | grep "^40 "
40 0 253:0 / / rw,relatime shared:1 - xfs /dev/mapper/centos-root rw,seclabel,attr2,inode64,noquota
可以看到RBD
挂载在253:0
设备上,这是宿主机的根目录所挂载的位置。
$ cat /proc/self/mountinfo | grep 18a8fb7b-446d-11e9-bbd8-6c92bf74be54
303 40 0:56 / /var/lib/kubelet/pods/18a8fb7b-446d-11e9-bbd8-6c92bf74be54/volumes/kubernetes.io~secret/default-token-zn95h rw,relatime shared:233 - tmpfs tmpfs rw,seclabel
318 40 252:0 / /var/lib/kubelet/pods/18a8fb7b-446d-11e9-bbd8-6c92bf74be54/volumes/kubernetes.io~rbd/pvc-d6f6b6f8-446c-11e9-bbd8-6c92bf74be54 rw,relatime shared:262 - ext4 /dev/rbd0 rw,seclabel,stripe=1024,data=ordered
$ cat /proc/self/mountinfo | grep shared:233
303 40 0:56 / /var/lib/kubelet/pods/18a8fb7b-446d-11e9-bbd8-6c92bf74be54/volumes/kubernetes.io~secret/default-token-zn95h rw,relatime shared:233 - tmpfs tmpfs rw,seclabel
可以看到Pod
挂载了两个卷,除了之前的RBD
,还有就是一个存放Secret
的卷。
$ docker inspect $(docker ps | grep nginx_nginx-test | awk '{print $1}') | grep Mounts -A33
"Mounts": [
{
"Type": "bind",
"Source": "/var/lib/kubelet/pods/18a8fb7b-446d-11e9-bbd8-6c92bf74be54/volumes/kubernetes.io~rbd/pvc-d6f6b6f8-446c-11e9-bbd8-6c92bf74be54",
"Destination": "/data",
"Mode": "Z",
"RW": true,
"Propagation": "rprivate"
},
{
"Type": "bind",
"Source": "/var/lib/kubelet/pods/18a8fb7b-446d-11e9-bbd8-6c92bf74be54/volumes/kubernetes.io~secret/default-token-zn95h",
"Destination": "/var/run/secrets/kubernetes.io/serviceaccount",
"Mode": "ro,Z",
"RW": false,
"Propagation": "rprivate"
},
{
"Type": "bind",
"Source": "/var/lib/kubelet/pods/18a8fb7b-446d-11e9-bbd8-6c92bf74be54/etc-hosts",
"Destination": "/etc/hosts",
"Mode": "Z",
"RW": true,
"Propagation": "rprivate"
},
{
"Type": "bind",
"Source": "/var/lib/kubelet/pods/18a8fb7b-446d-11e9-bbd8-6c92bf74be54/containers/nginx/190cc168",
"Destination": "/dev/termination-log",
"Mode": "Z",
"RW": true,
"Propagation": "rprivate"
}
],
可以看到Docker
的这些卷最后都是通过Bind
挂载的,而且Mount
广播使用的是rprivate
属性。
$ docker exec -it $(docker ps | grep nginx_nginx-test | awk '{print $1}') df -h
Filesystem Size Used Avail Use% Mounted on
overlay 123G 4.7G 118G 4% /
tmpfs 64M 0 64M 0% /dev
tmpfs 189G 0 189G 0% /sys/fs/cgroup
/dev/rbd0 9.8G 37M 9.7G 1% /data
/dev/mapper/centos-root 123G 4.7G 118G 4% /etc/hosts
shm 64M 0 64M 0% /dev/shm
tmpfs 189G 12K 189G 1% /run/secrets/kubernetes.io/serviceaccount
tmpfs 189G 0 189G 0% /proc/acpi
tmpfs 189G 0 189G 0% /proc/scsi
tmpfs 189G 0 189G 0% /sys/firmware
$ docker exec -it $(docker ps | grep nginx_nginx-test | awk '{print $1}') cat /proc/self/mountinfo | grep -e rbd -e serviceaccount
617 599 252:0 / /data rw,relatime - ext4 /dev/rbd0 rw,seclabel,stripe=1024,data=ordered
623 599 0:56 / /run/secrets/kubernetes.io/serviceaccount ro,relatime - tmpfs tmpfs rw,seclabel
可以看到Pod
的容器内的确主要挂载了RBD
和Secret
两个目录。
分析总结
在Docker
中,无论使用哪种方式使用数据卷,实际上都是利用的Linux
的的mount --bind
绑定挂载功能实现。
在Kubernetes
中使用RBD
卷时,首先通过rbd map
到宿主机并进行格式化,然后mount
到宿主机目录,最后把这个宿主机目录mount --bind
到容器的指定目录中使用。
根据原理分析可以初步推断:在宿主机中测试RBD
读写性能和在Docker
和Kubernetes
中分别测试的性能没有本质区别,Docker
和Kubernetes
本身不会对RBD
性能造成影响(之后我又使用Fio对其进行完整的性能测试,和这个结论也是一致的)。
关于如何进行Docker和Kubernetes中使用Ceph RBD卷的原理分析问题的解答就分享到这里了,希望以上内容可以对大家有一定的帮助,如果你还有很多疑惑没有解开,可以关注天达云行业资讯频道了解更多相关知识。