前言
共享內存主要用于進程間通信,Linux有兩種共享內存(Shared Memory)機制:
(1)?** System V shared memory(shmget/shmat/shmdt) **
Original shared memory mechanism, still widely used?Sharing between unrelated processes.
(2)?**?POSIX shared memory(shm_open/shm_unlink) **
Sharing between unrelated processes, without overhead of filesystem I/O?Intended to be simpler and better than older APIs.
另外,在Linux中不得不提一下內存映射(也可用于進程間通信):
** Shared mappings – mmap(2) **
l?Shared anonymous mappings:Sharing between related processes only (related via fork())
l?Shared file mappings:Sharing between unrelated processes, backed by file in filesystem
System V共享內存歷史悠久,使用也很廣范,很多類Unix系統都支持。一般來說,我們在寫程序時也通常使用第一種。這里不再討論如何使用它們,關于POSIX共享內存的詳細介紹可以參考這里1,這里2。
**?講到那么多,那么問題來了,共享內存與tmpfs有什么關系??**
The POSIX shared memory object implementation on Linux 2.4 makes use of a dedicated filesystem, which is normally mounted under /dev/shm.
從這里可以看到,POSIX共享內存是基于tmpfs來實現的。實際上,更進一步,不僅PSM(POSIX shared memory),而且SSM(System V shared memory)在內核也是基于tmpfs實現的。
tmpfs介紹
下面是內核文檔中關于tmpfs的介紹:
tmpfs has the following uses:
1) There is always a kernel internal mount which you will not see at all. This is used for shared anonymous mappings and SYSV shared memory.
This mount does not depend on CONFIG_TMPFS. If CONFIG_TMPFS is not set, the user visible part of tmpfs is not build. But the internal mechanisms are always present.
2) glibc 2.2 and above expects tmpfs to be mounted at /dev/shm for POSIX shared memory (shm_open, shm_unlink). Adding the following line to /etc/fstab should take care of this:
tmpfs /dev/shm tmpfs defaults 0 0
Remember to create the directory that you intend to mount tmpfs on if necessary.
This mount is?not?needed for SYSV shared memory. The internal mount is used for that. (In the 2.3 kernel versions it was necessary to mount the predecessor of tmpfs (shm fs) to use SYSV shared memory)
從這里可以看到tmpfs主要有兩個作用:
(1)用于SYSV共享內存,還有匿名內存映射;這部分由內核管理,用戶不可見;
(2)用于POSIX共享內存,由用戶負責mount,而且一般mount到/dev/shm;依賴于CONFIG_TMPFS;
到這里,我們可以了解,SSM與PSM之間的區別,也明白了/dev/shm的作用。
下面我們來做一些測試:
測試
我們將/dev/shm的tmpfs設置為64M:
# mount -size=64M -o remount /dev/shm# df -lh
Filesystem ?????????????????Size ?Used Avail Use% Mounted on
tmpfs ?????????????????????????64M ????0 ??64M ??0% /dev/shm
SYSV共享內存的最大大小為32M:
# cat /proc/sys/kernel/shmmax
33554432
(1)創建65M的system V共享內存失敗:
# ipcmk -M 68157440 ??????????????????
ipcmk: create share memory failed: Invalid argument
這是正常的。
(2)將shmmax調整為65M
# echo 68157440 > /proc/sys/kernel/shmmax# cat /proc/sys/kernel/shmmax ????????????
68157440# ipcmk -M 68157440 ??????????????????????
Shared memory id: 0# ipcs -m
------ Shared Memory Segments --------
key ???????shmid ?????owner ?????perms ?????bytes ?????nattch ????status ?????
0xef46b249 0 ?????????root ??????644 ???????68157440 ??0 ??????????????????????
可以看到system v共享內存的大小并不受/dev/shm的影響。
(3)創建POSIX共享內存
點擊(此處)折疊或打開
/*gcc?-o shmopen shmopen.c?-lrt*/#include?
#include?
#include?
#include?
#include?
#include?
#include?
#define MAP_SIZE 68157440
int main(int?argc,?char?*argv[])
{
int?fd;
void*?result;
fd?=?shm_open("/shm1",?O_RDWR|O_CREAT,?0644);
if(fd?0){
printf("shm_open failed\n");
exit(1);
}
return 0;
}
# ./shmopen# ls -lh /dev/shm/shm1
-rw-r--r-- 1 root root 65M Mar ?3 06:19 /dev/shm/shm1
僅管/dev/shm只有64M,但創建65M的POSIX SM也可以成功。
(4)向POSIX SM寫數據
點擊(此處)折疊或打開
/*gcc?-o shmwrite shmwrite.c?-lrt*/#include?
#include?
#include?
#include?
#include?
#include?
#include?
#define MAP_SIZE 68157440
int main(int?argc,?char?*argv[])
{
int?fd;
void*?result;
fd?=?shm_open("/shm1",?O_RDWR|O_CREAT,?0644);
if(fd?0){
printf("shm_open failed\n");
exit(1);
}
if?(ftruncate(fd,?MAP_SIZE)?0){
printf("ftruncate failed\n");
exit(1);
}
result?=?mmap(NULL,?MAP_SIZE,?PROT_READ|PROT_WRITE,?MAP_SHARED,?fd,?0);
if(result?==?MAP_FAILED){
printf("mapped failed\n");
exit(1);
}
/*?...?operate result pointer?*/
printf("memset\n");
memset(result,?0,?MAP_SIZE);
//shm_unlink("/shm1");
return 0;
}
# ./shmwrite
memset
Bus error
可以看到,寫65M的數據會報Bus error錯誤。
但是,卻可以在/dev/shm創建新的文件:
# ls -lh /dev/shm/ -lh
總用量?64M
-rw-r--r-- 1 root root 65M 3月 ??3 15:23 shm1
-rw-r--r-- 1 root root 65M 3月 ??3 15:24 shm2
這很正常,ls顯示的是inode->size。
# stat /dev/shm/shm2
File:?"/dev/shm/shm2"
Size: 68157440 ???????Blocks: 0 ?????????IO Block: 4096 ??普通文件
Device: 10h/16d Inode: 217177 ?????Links: 1
Access:?(0644/-rw-r--r--)??Uid:?(????0/ ???root)???Gid:?(????0/ ???root)
Access: 2015-03-03 15:24:28.025985167 +0800
Modify: 2015-03-03 15:24:28.025985167 +0800
Change: 2015-03-03 15:24:28.025985167 +0800
(5)向SYS V共享內存寫數據
將System V共享內存的最大值調整為65M(/dev/shm仍然為64M)。
# cat /proc/sys/kernel/shmmax
68157440
點擊(此處)折疊或打開
/*gcc?-o shmv shmv.c*/#include?
#include?
#include?
#include?
#define MAP_SIZE 68157440
int main(int?argc,?char**?argv){
int?shm_id,i;
key_t key;
char temp;
char?*p_map;
char*?name?=?"/dev/shm/shm3";
key?=?ftok(name,0);
if(key==-1)
perror("ftok error");
shm_id=shmget(key,MAP_SIZE,IPC_CREAT);
if(shm_id==-1)
{
perror("shmget error");
return;
}
p_map=(char*)shmat(shm_id,NULL,0);
memset(p_map,?0,?MAP_SIZE);
if(shmdt(p_map)==-1)
perror(" detach error ");
}
#./shmv
卻可以正常執行。
(7)結論
雖然System V與POSIX共享內存都是通過tmpfs實現,但是受的限制卻不相同。也就是說/proc/sys/kernel/shmmax只會影響SYS V共享內存,/dev/shm只會影響Posix共享內存。實際上,System V與Posix共享內存本來就是使用的兩個不同的tmpfs實例(instance)。
內核分析
內核在初始化時,會自動mount一個tmpfs文件系統,掛載為shm_mnt:
點擊(此處)折疊或打開
//mm/shmem.cstatic struct file_system_type?
shmem_fs_type?=?{
.owner?=?THIS_MODULE,
.name?=?"tmpfs",
.get_sb?=?shmem_get_sb,
.kill_sb?=?kill_litter_super,
};
int?__init shmem_init(void)?{
...
error?=?register_filesystem(&shmem_fs_type);
if?(error)
{
printk(KERN_ERR?"Could not register tmpfs\n");
goto out2;
}
///掛載tmpfs(用于SYS V)
shm_mnt?=?vfs_kern_mount(&shmem_fs_type,?MS_NOUSER,shmem_fs_type.name,?NULL);
/dev/shm的mount與普通文件mount的流程類似,不再討論。但是,值得注意的是,/dev/shm默認的大小為當前物理內存的1/2:
shmem_get_sb –> shmem_fill_super
點擊(此處)折疊或打開
//mem/shmem.c
int?shmem_fill_super(struct super_block?*sb,?void?*data,?int?silent)
{
...
#ifdef CONFIG_TMPFS?
/*
*?Per default we only allow half of the physical ram per
*?tmpfs instance,?limiting inodes?to?one per page of lowmem;
*?but the internal instance?is?left?unlimited.
*/
if?(!(sb->s_flags?&?MS_NOUSER))?{///內核會設置MS_NOUSER?
sbinfo->max_blocks?=?shmem_default_max_blocks();
sbinfo->max_inodes?=?shmem_default_max_inodes();
if?(shmem_parse_options(data,?sbinfo,?false))?{
err?=?-EINVAL;
goto failed;
}
}
sb->s_export_op?=?&shmem_export_ops;
#else
...
#ifdef CONFIG_TMPFS
static unsigned long shmem_default_max_blocks(void)?{
return totalram_pages?/?2;
}
可以看到:由于內核在mount tmpfs時,指定了MS_NOUSER,所以該tmpfs沒有大小限制,因此,SYS V共享內存能夠使用的內存空間只受/proc/sys/kernel/shmmax限制;而用戶通過掛載的/dev/shm,默認為物理內存的1/2。
注意CONFIG_TMPFS.
另外,在/dev/shm創建文件走VFS接口,而SYS V與匿名映射卻是通過shmem_file_setup實現:
SIGBUS
當應用訪問共享內存對應的地址空間,如果對應的物理PAGE還沒有分配,就會調用fault方法,分配失敗,就會返回OOM或者BIGBUS錯誤:
點擊(此處)折疊或打開
static?const?struct vm_operations_struct shmem_vm_ops?=?{
.fault?=?shmem_fault,
#ifdef CONFIG_NUMA?
.set_policy?=?shmem_set_policy,
.get_policy?=?shmem_get_policy,
#endif
};
static?int?shmem_fault(struct vm_area_struct?*vma,?struct vm_fault?*vmf)
{
struct inode?*inode?=?vma->vm_file->f_path.dentry->d_inode;
int?error;
int?ret?=?VM_FAULT_LOCKED;
error?=?shmem_getpage(inode,?vmf->pgoff,?&vmf->page,?SGP_CACHE,?&ret);
if?(error)
return?((error?==?-ENOMEM)???VM_FAULT_OOM?:?VM_FAULT_SIGBUS);
return ret;
}
shmem_getpage –>?shmem_getpage_gfp:
/*
*?shmem_getpage_gfp?-?find page?in?cache,?or?get?from swap,?or?allocate
*
*?If?we allocate a new one we?do?not?mark it dirty.?That's up?to?the
*?vm.?If?we swap it?in?we mark it dirty since we also free the swap
*?entry since a page cannot live?in?both the swap?and?page cache
*/
static?int?shmem_getpage_gfp(struct inode?*inode,?pgoff_t index,
struct page?**pagep,?enum sgp_type sgp,?gfp_t gfp,?int?*fault_type)?
{
...
if?(sbinfo->max_blocks)?{?///dev/shm會有該值?
if?(percpu_counter_compare(&sbinfo->used_blocks,sbinfo->max_blocks)?>=?0)?{
error?=?-ENOSPC;
goto unacct;
}
percpu_counter_inc(&sbinfo->used_blocks);
}
//分配一個物理PAGE
page?=?shmem_alloc_page(gfp,?info,?index);
if?(!page)?{
error?=?-ENOMEM;
goto decused;
}
SetPageSwapBacked(page);
__set_page_locked(page);
error?=?mem_cgroup_cache_charge(page,?current->mm,gfp?&?GFP_RECLAIM_MASK);?///mem_cgroup檢查
if?(!error)
error?=?shmem_add_to_page_cache(page,?mapping,?index,?gfp,?NULL);
共享內存與CGROUP
目前,共享內存的空間計算在第一個訪問共享內存的group,參考:
l?http://lwn.net/Articles/516541/
l?https://www.kernel.org/doc/Documentation/cgroups/memory.txt
POSIX共享內存與Docker
目前Docker將/dev/shm限制為64M,卻沒有提供參數,這種做法比較糟糕。如果應用使用大內存的POSIX共享內存,必然會導致問題。 參考:
l?https://github.com/docker/docker/issues/2606
l?https://github.com/docker/docker/pull/4981
總結
(1)POSIX共享內存與SYS V共享內存在內核都是通過tmpfs實現,但對應兩個不同的tmpfs實例,相互獨立。
(2)通過/proc/sys/kernel/shmmax可以限制SYS V共享內存(單個)的最大值,通過/dev/shm可以限制POSIX共享內存的最大值(所有之和)。
?
評論