pagecache data corrupted

pagecache页面数据被写坏

Posted by icecube on April 9, 2024

简介

产品反馈虚拟机环境,内存配置为8G,升级版本时提示数据校验不过,然后升级失败。

4G内存配置无该问题。host是鲲鹏服务器,kunpeng-920处理器。

同样的虚拟机配置使用其它ubuntu或者centos的系统镜像,虚拟机都没有出现该问题。

定位过程

尝试在虚机设备上升级失败后计算版本文件的MD5值,每次都会变化。
drop_cache之后计算版本文件md5值是正确的,后续计算的MD5值都是错的,而且一直在变化。

每次MD5计算出错后,cp拷贝多份版本文件导出,
和正常版本文件对比发现,固定偏移处总有两处几个字节的变化,不像是单个bit位跳变,没有规律。

flash上的版本文件是没有改动的。因为清缓存后重新计算版本文件的MD5值是正确的。
既然没有造成page dirty,同一个页面也不太可能分配给两个不同的进程,
所以不倾向于pagecache机制问题,不倾向于内存分配问题。

测试偶然发现和配置telnet连接有关,不配置telnet,串口执行md5 ipe不会变化;
开启关闭一个telnet口,md5 ipe值也可能会变化。
telnet端口没有命令交互的话,几十秒md5值不会变化,但是之后会变,猜测可能和网络保活有关。

虚机镜像系统内核不支持硬件断点,启用了kasan的版本也没跑出来什么东西。

查找修改页面

先看看pagecache里的数据到底哪里被改了

写了个测试模块,遍历文件在pagecache中的所有的page,计算page页面数据的MD5值。
然后输出pfn、virtual address、physical address,以及md5值到trace缓冲区。

每次在MD5值出现变化后执行insmod ko 然后rmmod ko,拿到多次打印记录做对比。
两个page的MD5值一直在变,其它page数据MD5都是一样的。

打印页面数据

再写一个测试模块,根据指定pfn号,打印出这两个页面数据,然后多次数据记录做对比。
确实这两个page里固定偏移处的一两个字节在变。

下图中页面数据基本为0,只有固定偏移0x6a6处连续两个字节非0,
是因为系统刚起来,只打开一个telnet口,pagecahe里并没有多少东西。 可见telnet嫌疑之重。

设置页面写保护

之前遇到过只读页面写入数据触发保护异常的情况,
既然出问题的pfn固定,就想着能不能利用这种方式抓下,看看能不能抓得到。

写一个测试模块,根据pfn号找到对应的pte表项,设置为只读。
一旦telnet命令交互产生网络数据,如果写到pagecache的这两个页面,那理应抓得到。

加载两次ko模块 insmod pgrdonly.ko pfn=2238333insmod pgrdonly.ko pfn=2238378,打印如下

<3>[  765.828961] [pg_init 152] pc, init_mm addr 0xffff800028df1ae8
<3>[  765.829906] [set_page_rdonly 104] pc, pfn 2238333
<3>[  765.830630] [set_page_rdonly 111] pc, page 0xfffffc00094c5710
<3>[  765.831526] [set_page_rdonly 114] pc, page address 0xffff0001e277d000
<3>[  765.832539] [get_pte 42] pc, init_mm 0xffff800028df1ae8, page addr 0xffff0001e277d000
<3>[  765.833791] [ffff0001e277d000] pgd=000000023fff8003
<3>[  765.834489] pud=000000023f044003
<3>[  765.834991] pmd=000000023ef30003
<3>[  765.835457] pte=006800022277d713
<3>[  765.835993] [set_page_rdonly 117] pc, ptep 0xffff0001fef30be8
<3>[  765.836878] [set_pte_rdonly 84] pc, pte value 0x6800022277d713 before set rdonly
<3>[  765.838016] [set_pte_rdonly 86] pc, pte value 0xe000022277d793 construct wrprotect
<3>[  765.839166] [set_pte_rdonly 88] pc, pte value 0xe000022277d793 after set
<3>[  781.939330] Exiting set page readlony module

<3>[  784.383368] [pg_init 152] pc, init_mm addr 0xffff800028df1ae8
<3>[  784.384348] [set_page_rdonly 104] pc, pfn 2238378
<3>[  784.385135] [set_page_rdonly 111] pc, page 0xfffffc00094c6520
<3>[  784.386075] [set_page_rdonly 114] pc, page address 0xffff0001e27aa000
<3>[  784.387110] [get_pte 42] pc, init_mm 0xffff800028df1ae8, page addr 0xffff0001e27aa000
<3>[  784.388377] [ffff0001e27aa000] pgd=000000023fff8003
<3>[  784.389155] pud=000000023f044003
<3>[  784.389657] pmd=000000023ef30003
<3>[  784.390169] pte=00680002227aa713
<3>[  784.390676] [set_page_rdonly 117] pc, ptep 0xffff0001fef30d50
<3>[  784.391588] [set_pte_rdonly 84] pc, pte value 0x680002227aa713 before set rdonly
<3>[  784.392805] [set_pte_rdonly 86] pc, pte value 0xe00002227aa793 construct wrprotect
<3>[  784.394074] [set_pte_rdonly 88] pc, pte value 0xe00002227aa793 after set
<3>[  787.310472] Exiting set page readlony module

然后telnet端口随便执行一些命令,系统随即触发了保护异常。
0xffff0001e27aa6a6就位于pfn = 2238378的页面内部,
0x6a6的页内偏移也和页面数据对比中的偏移一致,可以看到是模块wan协议栈代码写越界。

<1>[  796.021377] Unable to handle kernel write to read-only memory at virtual address ffff0001e27aa6a6
<1>[  796.022942] Mem abort info:
<1>[  796.023373]   ESR = 0x9600004f
<1>[  796.023912]   EC = 0x25: DABT (current EL), IL = 32 bits
<1>[  796.024764]   SET = 0, FnV = 0
<1>[  796.025232]   EA = 0, S1PTW = 0
<1>[  796.025737] Data abort info:
<1>[  796.026185]   ISV = 0, ISS = 0x0000004f
<1>[  796.026812]   CM = 0, WnR = 1
<1>[  796.027278] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000040a66000
<1>[  796.028376] [ffff0001e27aa6a6] pgd=000000023fff8003, pud=000000023f044003, pmd=000000023ef30003, pte=00e00002227aa793
<0>[  796.030145] Internal error: Oops: 9600004f [#1] SMP
<4>[  796.030936] Modules linked in: wan(PO) vsr(O) [last unloaded: pgrdonly]
<4>[  796.032370] CPU: 1 PID: 1993 Comm: kdrvfwdd1 Tainted: P           O      5.4.90 #1
<4>[  796.033595] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
<4>[  796.034707] pstate: 60400005 (nZCv daif +PAN -UAO)
<4>[  796.037144] pc : ip_newid+0x4c/0x70 [wan]
<4>[  796.039317] lr : ip_newid+0x1c/0x70 [wan]
<4>[  796.040064] sp : ffff00019f6ce420
<4>[  796.040677] x29: ffff00019f6ce420 x28: 0000000000000000
<4>[  796.041619] x27: ffff000194cf43c8 x26: ffff800028963348
<4>[  796.042483] x25: ffff800008beae2c x24: ffff0001c0bb82d0
<4>[  796.043351] x23: ffff0001c00ef8f0 x22: ffff800028ea0858
<4>[  796.044209] x21: ffff000194238000 x20: ffff000194cf4380
<4>[  796.045114] x19: 0000000000000048 x18: ffff800028a04b18
<4>[  796.045984] x17: 0000000000000001 x16: ffff800028770420
<4>[  796.046886] x15: 0000000010000000 x14: ffff800029973000
<4>[  796.047754] x13: 0000000000000001 x12: 00000000ffffffff
<4>[  796.048645] x11: ffff800041f01000 x10: 0000000000000a00
<4>[  796.049515] x9 : ffff80002819c458 x8 : ffff00019f6ce4d0
<4>[  796.050367] x7 : 0000000000000000 x6 : 0000000000000001
<4>[  796.051256] x5 : 0000000000000000 x4 : ffff000194238000
<4>[  796.052143] x3 : ffff800028d66000 x2 : 0000000000000005
<4>[  796.053014] x1 : 000000000000f07e x0 : ffff0001e27aa6a6
<4>[  796.053891] Call trace:
<4>[  796.055566]  ip_newid+0x4c/0x70 [wan]
<4>[  796.057519]  tcp_output+0x26c4/0x2d00 [wan]
<4>[  796.059676]  tcp_do_segment+0xfd0/0x43e0 [wan]
<4>[  796.061788]  tcp_input+0x277c/0x2ba0 [wan]
<4>[  796.063818]  ip_protoinput+0xa8/0xd0 [wan]
...... // 省去部分业务栈
<4>[  796.099035]  kthread+0x12c/0x130
<4>[  796.099546]  ret_from_fork+0x10/0x18
<0>[  796.100167] Code: 79400000 0b000020 12003c01 f9400fe0 (79000001)
<4>[  796.101154] task_switch hooks:
[1]kdb>

模块代码

md5sum pagecache pages

#include <linux/module.h>
#include <linux/fs.h>
#include <linux/pagemap.h>
#include <linux/highmem.h>
#include <linux/mm.h>
#include <linux/slab.h>
#include <linux/sched.h>
#include <linux/namei.h>
#include <linux/fs_struct.h>
#include <linux/mount.h>
#include <linux/path.h>
#include <linux/crypto.h>
#include <crypto/hash.h>
#include <linux/hash.h>
#include <linux/init.h>
#include <linux/kernel.h>
#include <linux/mm_types.h>


#define MD5_DIGEST_LENGTH 16

static void calculate_md5(const char *data, size_t len, char *digest)
{
	struct crypto_shash *tfm;
	struct shash_desc *desc;
	char hex_digest[MD5_DIGEST_LENGTH * 2 + 1] = {0}; // 十六进制表示MD5值
	char *hash;
	int ret;

	tfm = crypto_alloc_shash("md5", 0, 0);
	if (IS_ERR(tfm)) {
		pr_err("Failed to allocate transform for MD5\n");
		return;
	}

	desc = kmalloc(sizeof(struct shash_desc) + crypto_shash_descsize(tfm),
			GFP_KERNEL);
	if (!desc) {
		pr_err("Failed to allocate shash descriptor for MD5\n");
		crypto_free_shash(tfm);
		return;
	}

	desc->tfm = tfm;

	ret = crypto_shash_init(desc);
	if (ret) {
		pr_err("Failed to initialize MD5 hash\n");
		kfree(desc);
		crypto_free_shash(tfm);
		return;
	}

	ret = crypto_shash_update(desc, data, len);
	if (ret) {
		pr_err("Failed to update MD5 hash\n");
		kfree(desc);
		crypto_free_shash(tfm);
		return;
	}

	hash = kmalloc(MD5_DIGEST_LENGTH, GFP_KERNEL);
	if (!hash) {
		pr_err("Failed to allocate memory for MD5 hash\n");
		kfree(desc);
		crypto_free_shash(tfm);
		return;
	}

	ret = crypto_shash_final(desc, hash);
	if (ret) {
		pr_err("Failed to finalize MD5 hash\n");
		kfree(hash);
		kfree(desc);
		crypto_free_shash(tfm);
		return;
	}
	bin2hex(hex_digest, hash, MD5_DIGEST_LENGTH);
	memcpy(digest, hex_digest, MD5_DIGEST_LENGTH * 2 + 1);

	kfree(hash);
	kfree(desc);
	crypto_free_shash(tfm);
}

static int print_page_md5(const char *filename)
{
	struct file *file;
	struct path path;
	struct inode *inode;
	struct page *page;
	unsigned long index;
	unsigned long phys_addr;
	unsigned long pfn;
	char *data;
	char digest[MD5_DIGEST_LENGTH * 2 + 1] = {0}; // 十六进制表示MD5值

	if (kern_path(filename, LOOKUP_FOLLOW, &path)) {
		printk(KERN_ERR "Failed to get path for file: %s\n", filename);
		return -ENOENT;
	}

	file = filp_open(filename, O_RDONLY, 0);
	if (IS_ERR(file)) {
		printk(KERN_ERR "Failed to open file: %s\n", filename);
		return PTR_ERR(file);
	}

	inode = file_inode(file);

	index = 0;
	while ((page = find_get_page(inode->i_mapping, index))) {
		pfn = page_to_pfn(page);
		phys_addr = page_to_phys(page);

		data = kmap(page);
		if (!data) {
			printk(KERN_ERR "Failed to map page data\n");
			put_page(page);
			continue;
		}

		calculate_md5(data, PAGE_SIZE, digest);
		trace_printk("%lu: pfn %lu, va 0x%lx, pa 0x%lx, md5 %s\n", index, pfn, (unsigned long)data, phys_addr, digest);

		kunmap(page);
		put_page(page);
		index++;

		cond_resched();
	}

	filp_close(file, NULL);
	return 0;
}

static int __init pg_init(void)
{
	const char *filename = "/mnt/flash/ipe";
	pr_err("[%s %d] pc, filename %s\n", __func__, __LINE__, filename);
	print_page_md5(filename);
	return 0;
}

static void __exit pg_exit(void)
{
	printk(KERN_INFO "Exiting pagecache_md5 module\n");
}

module_init(pg_init);
module_exit(pg_exit);

dump page content

#include <linux/module.h>
#include <linux/fs.h>
#include <linux/pagemap.h>
#include <linux/highmem.h>
#include <linux/mm.h>
#include <linux/slab.h>
#include <linux/sched.h>
#include <linux/namei.h>
#include <linux/fs_struct.h>
#include <linux/mount.h>
#include <linux/path.h>
#include <linux/crypto.h>
#include <crypto/hash.h>
#include <linux/hash.h>
#include <asm/delay.h>
#include <linux/delay.h>
#include <linux/printk.h>

unsigned char cont[PAGE_SIZE] = {0};

typedef unsigned long ulong;
unsigned long pfn;
module_param(pfn, ulong, S_IRUGO);

static void print_page(unsigned long pfn)
{
	struct page *page;
	unsigned char *data;
	unsigned long phys_addr;

	page = pfn_to_page(pfn);
	phys_addr = page_to_phys(page);

	data = kmap(page);
	if (!data) {
		printk(KERN_ERR "Failed to map page data\n");
		put_page(page);
		return;
	}
	memcpy(cont, data, PAGE_SIZE);

	pr_err("pfn %lu, va 0x%lx, pa 0x%lx\n", pfn, (unsigned long)data, phys_addr);
	pr_err("------------------------------------------\n");
	print_hex_dump(KERN_ERR, "", DUMP_PREFIX_OFFSET, 16, 1, cont, PAGE_SIZE, true);
	pr_err("------------------------------------------\n");

	kunmap(page);
	put_page(page);
}

static int __init pc_init(void)
{
	pr_err("[%s %d] pc, hello print page content pfn 2238334,2238397,2238000\n", __func__, __LINE__);
	print_page(2238334);
	print_page(2238397);
	print_page(2238000);
	return 0;
}

static void __exit pc_exit(void)
{
	printk(KERN_INFO "Exiting print page content module\n");
}

module_init(pc_init);
module_exit(pc_exit);

set page read-only

#include <linux/module.h>
#include <linux/fs.h>
#include <linux/pagemap.h>
#include <linux/highmem.h>
#include <linux/mm.h>
#include <linux/slab.h>
#include <linux/sched.h>
#include <linux/namei.h>
#include <linux/fs_struct.h>
#include <linux/mount.h>
#include <linux/path.h>
#include <linux/crypto.h>
#include <crypto/hash.h>
#include <linux/hash.h>
#include <linux/init.h>
#include <linux/kernel.h>
#include <linux/mm.h>
#include <linux/pfn.h>
#include <linux/slab.h>
#include <linux/kallsyms.h>
#include <asm/pgtable.h>

#include <asm/page.h>
#include <asm/pgalloc.h>

typedef unsigned long ulong;

static unsigned long pfn;

module_param(pfn, ulong, S_IRUGO);

struct mm_struct *pmm = NULL;

static pte_t *get_pte(unsigned long addr)
{
	struct mm_struct *mm = pmm;
	pgd_t *pgdp, pgd;
	pud_t *pudp, pud;
	pmd_t *pmdp, pmd;
	pte_t *ptep = NULL, pte;

	pr_err("[%s %d] pc, init_mm 0x%lx, page addr 0x%lx\n", __func__, __LINE__, (unsigned long)pmm, addr);

	pgdp = pgd_offset(mm, addr);
	pgd = READ_ONCE(*pgdp);
	pr_err("[%016lx] pgd=%016llx\n", addr, pgd_val(pgd));

	if (pgd_none(pgd) || pgd_bad(pgd)) {
		pr_err("[%s %d] pc, pgd invalid\n", __func__, __LINE__);
		return 0;
	}

	pudp = pud_offset(pgdp, addr);
	pud = READ_ONCE(*pudp);
	pr_err("pud=%016llx\n", pud_val(pud));
	if (pud_none(pud) || pud_bad(pud)) {
		pr_err("[%s %d] pc, pud invalid\n", __func__, __LINE__);
		return 0;
	}

	pmdp = pmd_offset(pudp, addr);
	pmd = READ_ONCE(*pmdp);
	pr_err("pmd=%016llx\n", pmd_val(pmd));
	if (pmd_none(pmd) || pmd_bad(pmd)) {
		pr_err("[%s %d] pc, pmd invalid\n", __func__, __LINE__);
		return 0;
	}

	ptep = pte_offset_map(pmdp, addr);
	pte = READ_ONCE(*ptep);
	if (!pte_valid(READ_ONCE(*ptep))) {
		pr_err("[%s %d] pc, pte invalid\n", __func__, __LINE__);
		return 0;
	}
	pr_err("pte=%016llx\n", pte_val(pte));

	return ptep;
}

static void set_pte_rdonly(pte_t *ptep, unsigned long address)
{
	pte_t pte;
	pte = READ_ONCE(*ptep);
	pr_err("[%s %d] pc, pte value 0x%lx before set rdonly\n", __func__, __LINE__, (unsigned long)pte_val(pte));
	pte = pte_wrprotect(pte);
	pr_err("[%s %d] pc, pte value 0x%lx construct wrprotect\n", __func__, __LINE__, (unsigned long)pte_val(pte));
	set_pte_at(pmm, address, ptep, pte);
	pr_err("[%s %d] pc, pte value 0x%lx after set\n", __func__, __LINE__, (unsigned long)pte_val(READ_ONCE(*ptep)));
	flush_tlb_all();
}

static void write_rdonly_page(unsigned long addr)
{
	memset((void*)addr, 0x0, PAGE_SIZE);
}

static void set_page_rdonly(unsigned long pfn)
{
	struct page *page = NULL;
	unsigned long address;
	pte_t *ptep;

	pr_err("[%s %d] pc, pfn %lu\n", __func__, __LINE__, pfn);
	page = pfn_to_page(pfn);
	if (!page) {
		pr_err("[%s %d] pc, get pfn %lu page failure\n", __func__, __LINE__, pfn);
		return;
	}

	pr_err("[%s %d] pc, page 0x%lx\n", __func__, __LINE__, (unsigned long)page);

	address = (unsigned long)page_address(page);
	pr_err("[%s %d] pc, page address 0x%lx\n", __func__, __LINE__, address);

	ptep = get_pte(address);
	pr_err("[%s %d] pc, ptep 0x%lx\n", __func__, __LINE__, (unsigned long)ptep);

	set_pte_rdonly(ptep, address);
}

static void set_alloc_page_rdonly(void)
{
	struct page *page;
	unsigned long address;
	pte_t *ptep;

	page = alloc_page(GFP_KERNEL);
	if (!page) {
		pr_err("[%s %d] pc, alloc_page failure\n", __func__, __LINE__);
		return;
	}

	pr_err("[%s %d] pc, page 0x%lx\n", __func__, __LINE__, (unsigned long)page);

	address = (unsigned long)page_address(page);
	pr_err("[%s %d] pc, page address 0x%lx\n", __func__, __LINE__, address);
	ptep = get_pte(address);

	pr_err("[%s %d] pc, ptep 0x%lx\n", __func__, __LINE__, (unsigned long)ptep);
	set_pte_rdonly(ptep, address);
}

static int __init pg_init(void)
{
	pmm = (struct mm_struct *)kallsyms_lookup_name("init_mm");
	if (!pmm) {
		pr_err("[%s %d] pc, find init_mm failure\n", __func__, __LINE__);
		return -1;
	}

	pr_err("[%s %d] pc, init_mm addr 0x%lx\n", __func__, __LINE__, (unsigned long)pmm);
	set_page_rdonly(pfn);
	/*set_alloc_page_rdonly();*/
	return 0;
}

static void __exit pg_exit(void)
{
	pr_err("Exiting set page readlony module\n");
}

module_init(pg_init);
module_exit(pg_exit);

参考

linux-5.4.90