Author: Zhen Ni <[email protected]> Date: Wed Mar 2 15:42:41 2022 +0800 ALSA: intel_hdmi: Fix reference to PCM buffer address commit 0aa6b294b312d9710804679abd2c0c8ca52cc2bc upstream. PCM buffers might be allocated dynamically when the buffer preallocation failed or a larger buffer is requested, and it's not guaranteed that substream->dma_buffer points to the actually used buffer. The driver needs to refer to substream->runtime->dma_addr instead for the buffer address. Signed-off-by: Zhen Ni <[email protected]> Cc: <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Takashi Iwai <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Adam Ford <[email protected]> Date: Tue Jan 25 11:11:25 2022 -0600 arm64: dts: imx8mm: Fix VPU Hanging [ Upstream commit ef3075d6638d3d5353a97fcc7bb0338fc85675f5 ] The vpumix power domain has a reset assigned to it, however when used, it causes a system hang. Testing has shown that it does not appear to be needed anywhere. Fixes: d39d4bb15310 ("arm64: dts: imx8mm: add GPC node") Signed-off-by: Adam Ford <[email protected]> Reviewed-by: Lucas Stach <[email protected]> Signed-off-by: Shawn Guo <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Robin Murphy <[email protected]> Date: Mon Jan 24 17:57:01 2022 +0000 arm64: dts: juno: Remove GICv2m dma-range [ Upstream commit 31eeb6b09f4053f32a30ce9fbcdfca31f713028d ] Although it is painstakingly honest to describe all 3 PCI windows in "dma-ranges", it misses the the subtle distinction that the window for the GICv2m range is normally programmed for Device memory attributes rather than Normal Cacheable like the DRAM windows. Since MMU-401 only offers stage 2 translation, this means that when the PCI SMMU is enabled, accesses through that IPA range unexpectedly lose coherency if mapped as cacheable at the SMMU, due to the attribute combining rules. Since an extra 256KB is neither here nor there when we still have 10GB worth of usable address space, rather than attempting to describe and cope with this detail let's just remove the offending range. If the SMMU is not used then it makes no difference anyway. Link: https://lore.kernel.org/r/856c3f7192c6c3ce545ba67462f2ce9c86ed6b0c.1643046936.git.robin.murphy@arm.com Fixes: 4ac4d146cb63 ("arm64: dts: juno: Describe PCI dma-ranges") Reported-by: Anders Roxell <[email protected]> Acked-by: Liviu Dudau <[email protected]> Signed-off-by: Robin Murphy <[email protected]> Signed-off-by: Sudeep Holla <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Frank Wunderlich <[email protected]> Date: Sun Jan 23 14:35:10 2022 +0100 arm64: dts: rockchip: drop pclk_xpcs from gmac0 on rk3568 [ Upstream commit 85a8bccfa945680dc561f06b65ea01341d2033fc ] pclk_xpcs is not supported by mainline driver and breaks dtbs_check following warnings occour, and many more rk3568-evb1-v10.dt.yaml: ethernet@fe2a0000: clocks: [[15, 386], [15, 389], [15, 389], [15, 184], [15, 180], [15, 181], [15, 389], [15, 185], [15, 172]] is too long From schema: Documentation/devicetree/bindings/net/snps,dwmac.yaml rk3568-evb1-v10.dt.yaml: ethernet@fe2a0000: clock-names: ['stmmaceth', 'mac_clk_rx', 'mac_clk_tx', 'clk_mac_refout', 'aclk_mac', 'pclk_mac', 'clk_mac_speed', 'ptp_ref', 'pclk_xpcs'] is too long From schema: Documentation/devicetree/bindings/net/snps,dwmac.yaml after removing it, the clock and other warnings are gone. pclk_xpcs on gmac is used to support QSGMII, but this requires a driver supporting it. Once xpcs support is introduced, the clock can be added to the documentation and both controllers. Fixes: b8d41e5053cd ("arm64: dts: rockchip: add gmac0 node to rk3568") Co-developed-by: Peter Geis <[email protected]> Signed-off-by: Peter Geis <[email protected]> Signed-off-by: Frank Wunderlich <[email protected]> Acked-by: Michael Riesch <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Heiko Stuebner <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Peter Geis <[email protected]> Date: Thu Jan 27 19:38:05 2022 -0500 arm64: dts: rockchip: fix Quartz64-A ddr regulator voltage [ Upstream commit ad02776cf8d083e28b1ca4d93d8b1949668c27cc ] The Quartz64 Model A uses a voltage divider to ensure ddr voltage is within specification from the default regulator configuration. Adjusting this voltage is detrimental, and currently causes the ddr voltage to be about 0.8v. Remove the min and max voltage setpoints for the ddr regulator. Fixes: b33a22a1e7c4 ("arm64: dts: rockchip: add basic dts for Pine64 Quartz64-A") Signed-off-by: Peter Geis <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Heiko Stuebner <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Brian Norris <[email protected]> Date: Fri Jan 14 15:02:07 2022 -0800 arm64: dts: rockchip: Switch RK3399-Gru DP to SPDIF output commit b5fbaf7d779f5f02b7f75b080e7707222573be2a upstream. Commit b18c6c3c7768 ("ASoC: rockchip: cdn-dp sound output use spdif") switched the platform to SPDIF, but we didn't fix up the device tree. Drop the pinctrl settings, because the 'spdif_bus' pins are either: * unused (on kevin, bob), so the settings is ~harmless * used by a different function (on scarlet), which causes probe failures (!!) Fixes: b18c6c3c7768 ("ASoC: rockchip: cdn-dp sound output use spdif") Signed-off-by: Brian Norris <[email protected]> Reviewed-by: Chen-Yu Tsai <[email protected]> Link: https://lore.kernel.org/r/20220114150129.v2.1.I46f64b00508d9dff34abe1c3e8d2defdab4ea1e5@changeid Signed-off-by: Heiko Stuebner <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Masami Hiramatsu <[email protected]> Date: Mon Jan 24 17:17:54 2022 +0900 arm64: Mark start_backtrace() notrace and NOKPROBE_SYMBOL [ Upstream commit 1e0924bd09916fab795fc2a21ec1d148f24299fd ] Mark the start_backtrace() as notrace and NOKPROBE_SYMBOL because this function is called from ftrace and lockdep to get the caller address via return_address(). The lockdep is used in kprobes, it should also be NOKPROBE_SYMBOL. Fixes: b07f3499661c ("arm64: stacktrace: Move start_backtrace() out of the header") Cc: <[email protected]> # 5.13.x Signed-off-by: Masami Hiramatsu <[email protected]> Reviewed-by: Mark Brown <[email protected]> Link: https://lore.kernel.org/r/164301227374.1433152.12808232644267107415.stgit@devnote2 Signed-off-by: Catalin Marinas <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Randy Dunlap <[email protected]> Date: Wed Feb 23 20:46:35 2022 +0100 ARM: 9182/1: mmu: fix returns from early_param() and __setup() functions commit 7b83299e5b9385943a857d59e15cba270df20d7e upstream. early_param() handlers should return 0 on success. __setup() handlers should return 1 on success, i.e., the parameter has been handled. A return of 0 would cause the "option=value" string to be added to init's environment strings, polluting it. ../arch/arm/mm/mmu.c: In function 'test_early_cachepolicy': ../arch/arm/mm/mmu.c:215:1: error: no return statement in function returning non-void [-Werror=return-type] ../arch/arm/mm/mmu.c: In function 'test_noalign_setup': ../arch/arm/mm/mmu.c:221:1: error: no return statement in function returning non-void [-Werror=return-type] Fixes: b849a60e0903 ("ARM: make cr_alignment read-only #ifndef CONFIG_CPU_CP15") Signed-off-by: Randy Dunlap <[email protected]> Reported-by: Igor Zhbanov <[email protected]> Cc: Uwe Kleine-König <[email protected]> Cc: [email protected] Cc: [email protected] Signed-off-by: Russell King (Oracle) <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Anthoine Bourgeois <[email protected]> Date: Tue Jan 25 20:11:38 2022 +0100 ARM: dts: switch timer config to common devkit8000 devicetree [ Upstream commit 64324ef337d0caa5798fa8fa3f6bbfbd3245868a ] This patch allow lcd43 and lcd70 flavors to benefit from timer evolution. Fixes: e428e250fde6 ("ARM: dts: Configure system timers for omap3") Signed-off-by: Anthoine Bourgeois <[email protected]> Signed-off-by: Tony Lindgren <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Anthoine Bourgeois <[email protected]> Date: Tue Jan 25 20:11:39 2022 +0100 ARM: dts: Use 32KiHz oscillator on devkit8000 [ Upstream commit 8840f5460a23759403f1f2860429dcbcc2f04a65 ] Devkit8000 board seems to always used 32k_counter as clocksource. Restore this behavior. If clocksource is back to 32k_counter, timer12 is now the clockevent source (as before) and timer2 is not longer needed here. This commit fixes the same issue observed with commit 23885389dbbb ("ARM: dts: Fix timer regression for beagleboard revision c") when sleep is blocked until hitting keys over serial console. Fixes: aba1ad05da08 ("clocksource/drivers/timer-ti-dm: Add clockevent and clocksource support") Fixes: e428e250fde6 ("ARM: dts: Configure system timers for omap3") Signed-off-by: Anthoine Bourgeois <[email protected]> Signed-off-by: Tony Lindgren <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Russell King (Oracle) <[email protected]> Date: Wed Feb 16 15:37:38 2022 +0000 ARM: Fix kgdb breakpoint for Thumb2 commit d920eaa4c4559f59be7b4c2d26fa0a2e1aaa3da9 upstream. The kgdb code needs to register an undef hook for the Thumb UDF instruction that will fault in order to be functional on Thumb2 platforms. Reported-by: Johannes Stezenbach <[email protected]> Tested-by: Johannes Stezenbach <[email protected]> Fixes: 5cbad0ebf45c ("kgdb: support for ARCH=arm") Signed-off-by: Russell King (Oracle) <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Thierry Reding <[email protected]> Date: Mon Dec 20 11:32:39 2021 +0100 ARM: tegra: Move panels to AUX bus [ Upstream commit 8d3b01e0d4bb54368d73d0984466d72c2eeeac74 ] Move the eDP panel on Venice 2 and Nyan boards into the corresponding AUX bus device tree node. This allows us to avoid a nasty circular dependency that would otherwise be created between the DPAUX and panel nodes via the DDC/I2C phandle. Fixes: eb481f9ac95c ("ARM: tegra: add Acer Chromebook 13 device tree") Fixes: 59fe02cb079f ("ARM: tegra: Add DTS for the nyan-blaze board") Fixes: 40e231c770a4 ("ARM: tegra: Enable eDP for Venice2") Signed-off-by: Thierry Reding <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Fabio Estevam <[email protected]> Date: Tue Feb 15 09:05:14 2022 -0300 ASoC: cs4265: Fix the duplicated control name commit c5487b9cdea5c1ede38a7ec94db0fc59963c8e86 upstream. Currently, the following error messages are seen during boot: asoc-simple-card sound: control 2:0:0:SPDIF Switch:0 is already present cs4265 1-004f: ASoC: failed to add widget SPDIF dapm kcontrol SPDIF Switch: -16 Quoting Mark Brown: "The driver is just plain buggy, it defines both a regular SPIDF Switch control and a SND_SOC_DAPM_SWITCH() called SPDIF both of which will create an identically named control, it can never have loaded without error. One or both of those has to be renamed or they need to be merged into one thing." Fix the duplicated control name by combining the two SPDIF controls here and move the register bits onto the DAPM widget and have DAPM control them. Fixes: f853d6b3ba34 ("ASoC: cs4265: Add a S/PDIF enable switch") Signed-off-by: Fabio Estevam <[email protected]> Acked-by: Charles Keepax <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Mark Brown <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Marek Vasut <[email protected]> Date: Tue Feb 15 14:06:45 2022 +0100 ASoC: ops: Shift tested values in snd_soc_put_volsw() by +min commit 9bdd10d57a8807dba0003af0325191f3cec0f11c upstream. While the $val/$val2 values passed in from userspace are always >= 0 integers, the limits of the control can be signed integers and the $min can be non-zero and less than zero. To correctly validate $val/$val2 against platform_max, add the $min offset to val first. Fixes: 817f7c9335ec0 ("ASoC: ops: Reject out of bounds values in snd_soc_put_volsw()") Signed-off-by: Marek Vasut <[email protected]> Cc: Mark Brown <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Mark Brown <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Kai Vehmanen <[email protected]> Date: Mon Feb 7 17:29:59 2022 +0200 ASoC: rt5668: do not block workqueue if card is unbound [ Upstream commit a6d78661dc903d90a327892bbc34268f3a5f4b9c ] The current rt5668_jack_detect_handler() assumes the component and card will always show up and implements an infinite usleep loop waiting for them to show up. This does not hold true if a codec interrupt (or other event) occurs when the card is unbound. The codec driver's remove or shutdown functions cannot cancel the workqueue due to the wait loop. As a result, code can either end up blocking the workqueue, or hit a kernel oops when the card is freed. Fix the issue by rescheduling the jack detect handler in case the card is not ready. In case card never shows up, the shutdown/remove/suspend calls can now cancel the detect task. Signed-off-by: Kai Vehmanen <[email protected]> Reviewed-by: Bard Liao <[email protected]> Reviewed-by: Ranjani Sridharan <[email protected]> Reviewed-by: Pierre-Louis Bossart <[email protected]> Reviewed-by: Péter Ujfalusi <[email protected]> Reviewed-by: Shuming Fan <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Mark Brown <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Kai Vehmanen <[email protected]> Date: Mon Feb 7 17:30:00 2022 +0200 ASoC: rt5682: do not block workqueue if card is unbound [ Upstream commit 4c33de0673ced9c7c37b3bbd9bfe0fda72340b2a ] The current rt5682_jack_detect_handler() assumes the component and card will always show up and implements an infinite usleep loop waiting for them to show up. This does not hold true if a codec interrupt (or other event) occurs when the card is unbound. The codec driver's remove or shutdown functions cannot cancel the workqueue due to the wait loop. As a result, code can either end up blocking the workqueue, or hit a kernel oops when the card is freed. Fix the issue by rescheduling the jack detect handler in case the card is not ready. In case card never shows up, the shutdown/remove/suspend calls can now cancel the detect task. Signed-off-by: Kai Vehmanen <[email protected]> Reviewed-by: Bard Liao <[email protected]> Reviewed-by: Ranjani Sridharan <[email protected]> Reviewed-by: Pierre-Louis Bossart <[email protected]> Reviewed-by: Péter Ujfalusi <[email protected]> Reviewed-by: Shuming Fan <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Mark Brown <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Kai Vehmanen <[email protected]> Date: Mon Feb 7 17:29:58 2022 +0200 ASoC: rt5682s: do not block workqueue if card is unbound [ Upstream commit d7b530fdc45e75a54914a194c4becd9672a4e24f ] The current rt5682s_jack_detect_handler() assumes the component and card will always show up and implements an infinite usleep loop waiting for them to show up. This does not hold true if a codec interrupt (or other event) occurs when the card is unbound. The codec driver's remove or shutdown functions cannot cancel the workqueue due to the wait loop. As a result, code can either end up blocking the workqueue, or hit a kernel oops when the card is freed. Fix the issue by rescheduling the jack detect handler in case the card is not ready. In case card never shows up, the shutdown/remove/suspend calls can now cancel the detect task. Signed-off-by: Kai Vehmanen <[email protected]> Reviewed-by: Bard Liao <[email protected]> Reviewed-by: Ranjani Sridharan <[email protected]> Reviewed-by: Pierre-Louis Bossart <[email protected]> Reviewed-by: Péter Ujfalusi <[email protected]> Reviewed-by: Shuming Fan <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Mark Brown <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Sergey Shtylyov <[email protected]> Date: Sat Feb 19 23:04:29 2022 +0300 ata: pata_hpt37x: fix PCI clock detection [ Upstream commit 5f6b0f2d037c8864f20ff15311c695f65eb09db5 ] The f_CNT register (at the PCI config. address 0x78) is 16-bit, not 8-bit! The bug was there from the very start... :-( Signed-off-by: Sergey Shtylyov <[email protected]> Fixes: 669a5db411d8 ("[libata] Add a bunch of PATA drivers.") Cc: [email protected] Signed-off-by: Damien Le Moal <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Andy Shevchenko <[email protected]> Date: Wed Feb 23 17:47:16 2022 +0200 auxdisplay: lcd2s: Fix lcd2s_redefine_char() feature commit 4424c35ead667ba2e8de7ab8206da66453e6f728 upstream. It seems that the lcd2s_redefine_char() has never been properly tested. The buffer is filled by DEF_CUSTOM_CHAR command followed by the character number (from 0 to 7), but immediately after that these bytes are rewritten by the decoded hex stream. Fix the index to fill the buffer after the command and number. Fixes: 8c9108d014c5 ("auxdisplay: add a driver for lcd2s character display") Cc: Lars Poeschel <[email protected]> Signed-off-by: Andy Shevchenko <[email protected]> Reviewed-by: Geert Uytterhoeven <[email protected]> [fixed typo in commit message] Signed-off-by: Miguel Ojeda <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Andy Shevchenko <[email protected]> Date: Wed Feb 23 17:47:17 2022 +0200 auxdisplay: lcd2s: Fix memory leak in ->remove() commit 898c0a15425a5bcaa8d44bd436eae5afd2483796 upstream. Once allocated the struct lcd2s_data is never freed. Fix the memory leak by switching to devm_kzalloc(). Fixes: 8c9108d014c5 ("auxdisplay: add a driver for lcd2s character display") Cc: Lars Poeschel <[email protected]> Signed-off-by: Andy Shevchenko <[email protected]> Signed-off-by: Miguel Ojeda <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Andy Shevchenko <[email protected]> Date: Wed Feb 23 17:47:18 2022 +0200 auxdisplay: lcd2s: Use proper API to free the instance of charlcd object commit 9ed331f8a0fb674f4f06edf05a1687bf755af27b upstream. While it might work, the current approach is fragile in a few ways: - whenever members in the structure are shuffled, the pointer will be wrong - the resource freeing may include more than covered by kfree() Fix this by using charlcd_free() call instead of kfree(). Fixes: 8c9108d014c5 ("auxdisplay: add a driver for lcd2s character display") Cc: Lars Poeschel <[email protected]> Signed-off-by: Andy Shevchenko <[email protected]> Signed-off-by: Miguel Ojeda <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Sven Eckelmann <[email protected]> Date: Sun Feb 27 23:23:49 2022 +0100 batman-adv: Don't expect inter-netns unique iflink indices commit 6c1f41afc1dbe59d9d3c8bb0d80b749c119aa334 upstream. The ifindex doesn't have to be unique for multiple network namespaces on the same machine. $ ip netns add test1 $ ip -net test1 link add dummy1 type dummy $ ip netns add test2 $ ip -net test2 link add dummy2 type dummy $ ip -net test1 link show dev dummy1 6: dummy1: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 96:81:55:1e:dd:85 brd ff:ff:ff:ff:ff:ff $ ip -net test2 link show dev dummy2 6: dummy2: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 5a:3c:af:35:07:c3 brd ff:ff:ff:ff:ff:ff But the batman-adv code to walk through the various layers of virtual interfaces uses this assumption because dev_get_iflink handles it internally and doesn't return the actual netns of the iflink. And dev_get_iflink only documents the situation where ifindex == iflink for physical devices. But only checking for dev->netdev_ops->ndo_get_iflink is also not an option because ipoib_get_iflink implements it even when it sometimes returns an iflink != ifindex and sometimes iflink == ifindex. The caller must therefore make sure itself to check both netns and iflink + ifindex for equality. Only when they are equal, a "physical" interface was detected which should stop the traversal. On the other hand, vxcan_get_iflink can also return 0 in case there was currently no valid peer. In this case, it is still necessary to stop. Fixes: b7eddd0b3950 ("batman-adv: prevent using any virtual device created on batman-adv as hard-interface") Fixes: 5ed4a460a1d3 ("batman-adv: additional checks for virtual interfaces on top of WiFi") Reported-by: Sabrina Dubroca <[email protected]> Signed-off-by: Sven Eckelmann <[email protected]> Signed-off-by: Simon Wunderlich <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Sven Eckelmann <[email protected]> Date: Mon Feb 28 00:01:24 2022 +0100 batman-adv: Request iflink once in batadv-on-batadv check commit 690bb6fb64f5dc7437317153902573ecad67593d upstream. There is no need to call dev_get_iflink multiple times for the same net_device in batadv_is_on_batman_iface. And since some of the .ndo_get_iflink callbacks are dynamic (for example via RCUs like in vxcan_get_iflink), it could easily happen that the returned values are not stable. The pre-checks before __dev_get_by_index are then of course bogus. Fixes: b7eddd0b3950 ("batman-adv: prevent using any virtual device created on batman-adv as hard-interface") Signed-off-by: Sven Eckelmann <[email protected]> Signed-off-by: Simon Wunderlich <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Sven Eckelmann <[email protected]> Date: Mon Feb 28 00:01:24 2022 +0100 batman-adv: Request iflink once in batadv_get_real_netdevice commit 6116ba09423f7d140f0460be6a1644dceaad00da upstream. There is no need to call dev_get_iflink multiple times for the same net_device in batadv_get_real_netdevice. And since some of the ndo_get_iflink callbacks are dynamic (for example via RCUs like in vxcan_get_iflink), it could easily happen that the returned values are not stable. The pre-checks before __dev_get_by_index are then of course bogus. Fixes: 5ed4a460a1d3 ("batman-adv: additional checks for virtual interfaces on top of WiFi") Signed-off-by: Sven Eckelmann <[email protected]> Signed-off-by: Simon Wunderlich <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Kees Cook <[email protected]> Date: Mon Feb 28 10:59:12 2022 -0800 binfmt_elf: Avoid total_mapping_size for ET_EXEC commit 439a8468242b313486e69b8cc3b45ddcfa898fbf upstream. Partially revert commit 5f501d555653 ("binfmt_elf: reintroduce using MAP_FIXED_NOREPLACE"), which applied the ET_DYN "total_mapping_size" logic also to ET_EXEC. At least ia64 has ET_EXEC PT_LOAD segments that are not virtual-address contiguous (but _are_ file-offset contiguous). This would result in a giant mapping attempting to cover the entire span, including the virtual address range hole, and well beyond the size of the ELF file itself, causing the kernel to refuse to load it. For example: $ readelf -lW /usr/bin/gcc ... Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz ... ... LOAD 0x000000 0x4000000000000000 0x4000000000000000 0x00b5a0 0x00b5a0 ... LOAD 0x00b5a0 0x600000000000b5a0 0x600000000000b5a0 0x0005ac 0x000710 ... ... ^^^^^^^^ ^^^^^^^^^^^^^^^^^^ ^^^^^^^^ ^^^^^^^^ File offset range : 0x000000-0x00bb4c 0x00bb4c bytes Virtual address range : 0x4000000000000000-0x600000000000bcb0 0x200000000000bcb0 bytes Remove the total_mapping_size logic for ET_EXEC, which reduces the ET_EXEC MAP_FIXED_NOREPLACE coverage to only the first PT_LOAD (better than nothing), and retains it for ET_DYN. Ironically, this is the reverse of the problem that originally caused problems with MAP_FIXED_NOREPLACE: overlapping PT_LOAD segments. Future work could restore full coverage if load_elf_binary() were to perform mappings in a separate phase from the loading (where it could resolve both overlaps and holes). Cc: Eric Biederman <[email protected]> Cc: Alexander Viro <[email protected]> Cc: [email protected] Cc: [email protected] Reported-by: matoro <[email protected]> Fixes: 5f501d555653 ("binfmt_elf: reintroduce using MAP_FIXED_NOREPLACE") Link: https://lore.kernel.org/r/[email protected] Tested-by: matoro <[email protected]> Link: https://lore.kernel.org/lkml/[email protected] Tested-By: John Paul Adrian Glaubitz <[email protected]> Link: https://lore.kernel.org/lkml/[email protected] Cc: [email protected] Signed-off-by: Kees Cook <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Yu Kuai <[email protected]> Date: Mon Feb 28 11:43:54 2022 +0800 blktrace: fix use after free for struct blk_trace commit 30939293262eb433c960c4532a0d59c4073b2b84 upstream. When tracing the whole disk, 'dropped' and 'msg' will be created under 'q->debugfs_dir' and 'bt->dir' is NULL, thus blk_trace_free() won't remove those files. What's worse, the following UAF can be triggered because of accessing stale 'dropped' and 'msg': ================================================================== BUG: KASAN: use-after-free in blk_dropped_read+0x89/0x100 Read of size 4 at addr ffff88816912f3d8 by task blktrace/1188 CPU: 27 PID: 1188 Comm: blktrace Not tainted 5.17.0-rc4-next-20220217+ #469 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-4 Call Trace: <TASK> dump_stack_lvl+0x34/0x44 print_address_description.constprop.0.cold+0xab/0x381 ? blk_dropped_read+0x89/0x100 ? blk_dropped_read+0x89/0x100 kasan_report.cold+0x83/0xdf ? blk_dropped_read+0x89/0x100 kasan_check_range+0x140/0x1b0 blk_dropped_read+0x89/0x100 ? blk_create_buf_file_callback+0x20/0x20 ? kmem_cache_free+0xa1/0x500 ? do_sys_openat2+0x258/0x460 full_proxy_read+0x8f/0xc0 vfs_read+0xc6/0x260 ksys_read+0xb9/0x150 ? vfs_write+0x3d0/0x3d0 ? fpregs_assert_state_consistent+0x55/0x60 ? exit_to_user_mode_prepare+0x39/0x1e0 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7fbc080d92fd Code: ce 20 00 00 75 10 b8 00 00 00 00 0f 05 48 3d 01 f0 ff ff 73 31 c3 48 83 1 RSP: 002b:00007fbb95ff9cb0 EFLAGS: 00000293 ORIG_RAX: 0000000000000000 RAX: ffffffffffffffda RBX: 00007fbb95ff9dc0 RCX: 00007fbc080d92fd RDX: 0000000000000100 RSI: 00007fbb95ff9cc0 RDI: 0000000000000045 RBP: 0000000000000045 R08: 0000000000406299 R09: 00000000fffffffd R10: 000000000153afa0 R11: 0000000000000293 R12: 00007fbb780008c0 R13: 00007fbb78000938 R14: 0000000000608b30 R15: 00007fbb780029c8 </TASK> Allocated by task 1050: kasan_save_stack+0x1e/0x40 __kasan_kmalloc+0x81/0xa0 do_blk_trace_setup+0xcb/0x410 __blk_trace_setup+0xac/0x130 blk_trace_ioctl+0xe9/0x1c0 blkdev_ioctl+0xf1/0x390 __x64_sys_ioctl+0xa5/0xe0 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae Freed by task 1050: kasan_save_stack+0x1e/0x40 kasan_set_track+0x21/0x30 kasan_set_free_info+0x20/0x30 __kasan_slab_free+0x103/0x180 kfree+0x9a/0x4c0 __blk_trace_remove+0x53/0x70 blk_trace_ioctl+0x199/0x1c0 blkdev_common_ioctl+0x5e9/0xb30 blkdev_ioctl+0x1a5/0x390 __x64_sys_ioctl+0xa5/0xe0 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae The buggy address belongs to the object at ffff88816912f380 which belongs to the cache kmalloc-96 of size 96 The buggy address is located 88 bytes inside of 96-byte region [ffff88816912f380, ffff88816912f3e0) The buggy address belongs to the page: page:000000009a1b4e7c refcount:1 mapcount:0 mapping:0000000000000000 index:0x0f flags: 0x17ffffc0000200(slab|node=0|zone=2|lastcpupid=0x1fffff) raw: 0017ffffc0000200 ffffea00044f1100 dead000000000002 ffff88810004c780 raw: 0000000000000000 0000000000200020 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88816912f280: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc ffff88816912f300: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc >ffff88816912f380: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc ^ ffff88816912f400: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc ffff88816912f480: fa fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc ================================================================== Fixes: c0ea57608b69 ("blktrace: remove debugfs file dentries from struct blk_trace") Signed-off-by: Yu Kuai <[email protected]> Reviewed-by: Greg Kroah-Hartman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Haimin Zhang <[email protected]> Date: Wed Feb 16 16:40:38 2022 +0800 block-map: add __GFP_ZERO flag for alloc_page in function bio_copy_kern [ Upstream commit cc8f7fe1f5eab010191aa4570f27641876fa1267 ] Add __GFP_ZERO flag for alloc_page in function bio_copy_kern to initialize the buffer of a bio. Signed-off-by: Haimin Zhang <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Ming Lei <[email protected]> Date: Wed Jan 26 11:58:30 2022 +0800 block: loop:use kstatfs.f_bsize of backing file to set discard granularity [ Upstream commit 06582bc86d7f48d35cd044098ca1e246e8c7c52e ] If backing file's filesystem has implemented ->fallocate(), we think the loop device can support discard, then pass sb->s_blocksize as discard_granularity. However, some underlying FS, such as overlayfs, doesn't set sb->s_blocksize, and causes discard_granularity to be set as zero, then the warning in __blkdev_issue_discard() is triggered. Christoph suggested to pass kstatfs.f_bsize as discard granularity, and this way is fine because kstatfs.f_bsize means 'Optimal transfer block size', which still matches with definition of discard granularity. So fix the issue by setting discard_granularity as kstatfs.f_bsize if it is available, otherwise claims discard isn't supported. Cc: Christoph Hellwig <[email protected]> Cc: Vivek Goyal <[email protected]> Reported-by: Pei Zhang <[email protected]> Signed-off-by: Ming Lei <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Luiz Augusto von Dentz <[email protected]> Date: Mon Feb 14 17:59:38 2022 -0800 Bluetooth: Fix bt_skb_sendmmsg not allocating partial chunks [ Upstream commit 29fb608396d6a62c1b85acc421ad7a4399085b9f ] Since bt_skb_sendmmsg can be used with the likes of SOCK_STREAM it shall return the partial chunks it could allocate instead of freeing everything as otherwise it can cause problems like bellow. Fixes: 81be03e026dc ("Bluetooth: RFCOMM: Replace use of memcpy_from_msg with bt_skb_sendmmsg") Reported-by: Paul Menzel <[email protected]> Link: https://lore.kernel.org/r/[email protected] BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=215594 Signed-off-by: Luiz Augusto von Dentz <[email protected]> Tested-by: Paul Menzel <[email protected]> (Nokia N9 (MeeGo/Harmattan) Signed-off-by: Marcel Holtmann <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Eric Dumazet <[email protected]> Date: Wed Mar 2 08:17:22 2022 -0800 bpf, sockmap: Do not ignore orig_len parameter commit 60ce37b03917e593d8e5d8bcc7ec820773daf81d upstream. Currently, sk_psock_verdict_recv() returns skb->len This is problematic because tcp_read_sock() might have passed orig_len < skb->len, due to the presence of TCP urgent data. This causes an infinite loop from tcp_read_sock() Followup patch will make tcp_read_sock() more robust vs bad actors. Fixes: ef5659280eb1 ("bpf, sockmap: Allow skipping sk_skb parser program") Reported-by: syzbot <[email protected]> Signed-off-by: Eric Dumazet <[email protected]> Acked-by: John Fastabend <[email protected]> Acked-by: Jakub Sitnicki <[email protected]> Tested-by: Jakub Sitnicki <[email protected]> Acked-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Filipe Manana <[email protected]> Date: Mon Feb 28 16:29:28 2022 +0000 btrfs: add missing run of delayed items after unlink during log replay commit 4751dc99627e4d1465c5bfa8cb7ab31ed418eff5 upstream. During log replay, whenever we need to check if a name (dentry) exists in a directory we do searches on the subvolume tree for inode references or or directory entries (BTRFS_DIR_INDEX_KEY keys, and BTRFS_DIR_ITEM_KEY keys as well, before kernel 5.17). However when during log replay we unlink a name, through btrfs_unlink_inode(), we may not delete inode references and dir index keys from a subvolume tree and instead just add the deletions to the delayed inode's delayed items, which will only be run when we commit the transaction used for log replay. This means that after an unlink operation during log replay, if we attempt to search for the same name during log replay, we will not see that the name was already deleted, since the deletion is recorded only on the delayed items. We run delayed items after every unlink operation during log replay, except at unlink_old_inode_refs() and at add_inode_ref(). This was due to an overlook, as delayed items should be run after evert unlink, for the reasons stated above. So fix those two cases. Fixes: 0d836392cadd5 ("Btrfs: fix mount failure after fsync due to hard link recreation") Fixes: 1f250e929a9c9 ("Btrfs: fix log replay failure after unlink and link combination") CC: [email protected] # 4.19+ Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Qu Wenruo <[email protected]> Date: Fri Feb 11 14:46:12 2022 +0800 btrfs: defrag: bring back the old file extent search behavior commit d5633b0dee02d7d25e93463a03709f11c71500e2 upstream. For defrag, we don't really want to use btrfs_get_extent() to iterate all extent maps of an inode. The reasons are: - btrfs_get_extent() can merge extent maps And the result em has the higher generation of the two, causing defrag to mark unnecessary part of such merged large extent map. This in fact can result extra IO for autodefrag in v5.16+ kernels. However this patch is not going to completely solve the problem, as one can still using read() to trigger extent map reading, and got them merged. The completely solution for the extent map merging generation problem will come as an standalone fix. - btrfs_get_extent() caches the extent map result Normally it's fine, but for defrag the target range may not get another read/write for a long long time. Such cache would only increase the memory usage. - btrfs_get_extent() doesn't skip older extent map Unlike the old find_new_extent() which uses btrfs_search_forward() to skip the older subtree, thus it will pick up unnecessary extent maps. This patch will fix the regression by introducing defrag_get_extent() to replace the btrfs_get_extent() call. This helper will: - Not cache the file extent we found It will search the file extent and manually convert it to em. - Use btrfs_search_forward() to skip entire ranges which is modified in the past This should reduce the IO for autodefrag. Reported-by: Filipe Manana <[email protected]> Fixes: 7b508037d4ca ("btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file()") Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Qu Wenruo <[email protected]> Signed-off-by: David Sterba <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Qu Wenruo <[email protected]> Date: Fri Feb 11 14:46:13 2022 +0800 btrfs: defrag: don't use merged extent map for their generation check commit 199257a78bb01341c3ba6e85bdcf3a2e6e452c6d upstream. For extent maps, if they are not compressed extents and are adjacent by logical addresses and file offsets, they can be merged into one larger extent map. Such merged extent map will have the higher generation of all the original ones. But this brings a problem for autodefrag, as it relies on accurate extent_map::generation to determine if one extent should be defragged. For merged extent maps, their higher generation can mark some older extents to be defragged while the original extent map doesn't meet the minimal generation threshold. Thus this will cause extra IO. So solve the problem, here we introduce a new flag, EXTENT_FLAG_MERGED, to indicate if the extent map is merged from one or more ems. And for autodefrag, if we find a merged extent map, and its generation meets the generation requirement, we just don't use this one, and go back to defrag_get_extent() to read extent maps from subvolume trees. This could cause more read IO, but should result less defrag data write, so in the long run it should be a win for autodefrag. Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Qu Wenruo <[email protected]> Signed-off-by: David Sterba <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Josef Bacik <[email protected]> Date: Fri Feb 18 14:56:10 2022 -0500 btrfs: do not start relocation until in progress drops are done commit b4be6aefa73c9a6899ef3ba9c5faaa8a66e333ef upstream. We hit a bug with a recovering relocation on mount for one of our file systems in production. I reproduced this locally by injecting errors into snapshot delete with balance running at the same time. This presented as an error while looking up an extent item WARNING: CPU: 5 PID: 1501 at fs/btrfs/extent-tree.c:866 lookup_inline_extent_backref+0x647/0x680 CPU: 5 PID: 1501 Comm: btrfs-balance Not tainted 5.16.0-rc8+ #8 RIP: 0010:lookup_inline_extent_backref+0x647/0x680 RSP: 0018:ffffae0a023ab960 EFLAGS: 00010202 RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000000000000c RDI: 0000000000000000 RBP: ffff943fd2a39b60 R08: 0000000000000000 R09: 0000000000000001 R10: 0001434088152de0 R11: 0000000000000000 R12: 0000000001d05000 R13: ffff943fd2a39b60 R14: ffff943fdb96f2a0 R15: ffff9442fc923000 FS: 0000000000000000(0000) GS:ffff944e9eb40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f1157b1fca8 CR3: 000000010f092000 CR4: 0000000000350ee0 Call Trace: <TASK> insert_inline_extent_backref+0x46/0xd0 __btrfs_inc_extent_ref.isra.0+0x5f/0x200 ? btrfs_merge_delayed_refs+0x164/0x190 __btrfs_run_delayed_refs+0x561/0xfa0 ? btrfs_search_slot+0x7b4/0xb30 ? btrfs_update_root+0x1a9/0x2c0 btrfs_run_delayed_refs+0x73/0x1f0 ? btrfs_update_root+0x1a9/0x2c0 btrfs_commit_transaction+0x50/0xa50 ? btrfs_update_reloc_root+0x122/0x220 prepare_to_merge+0x29f/0x320 relocate_block_group+0x2b8/0x550 btrfs_relocate_block_group+0x1a6/0x350 btrfs_relocate_chunk+0x27/0xe0 btrfs_balance+0x777/0xe60 balance_kthread+0x35/0x50 ? btrfs_balance+0xe60/0xe60 kthread+0x16b/0x190 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x22/0x30 </TASK> Normally snapshot deletion and relocation are excluded from running at the same time by the fs_info->cleaner_mutex. However if we had a pending balance waiting to get the ->cleaner_mutex, and a snapshot deletion was running, and then the box crashed, we would come up in a state where we have a half deleted snapshot. Again, in the normal case the snapshot deletion needs to complete before relocation can start, but in this case relocation could very well start before the snapshot deletion completes, as we simply add the root to the dead roots list and wait for the next time the cleaner runs to clean up the snapshot. Fix this by setting a bit on the fs_info if we have any DEAD_ROOT's that had a pending drop_progress key. If they do then we know we were in the middle of the drop operation and set a flag on the fs_info. Then balance can wait until this flag is cleared to start up again. If there are DEAD_ROOT's that don't have a drop_progress set then we're safe to start balance right away as we'll be properly protected by the cleaner_mutex. CC: [email protected] # 5.10+ Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Josef Bacik <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Josef Bacik <[email protected]> Date: Fri Feb 18 10:17:39 2022 -0500 btrfs: do not WARN_ON() if we have PageError set commit a50e1fcbc9b85fd4e95b89a75c0884cb032a3e06 upstream. Whenever we do any extent buffer operations we call assert_eb_page_uptodate() to complain loudly if we're operating on an non-uptodate page. Our overnight tests caught this warning earlier this week WARNING: CPU: 1 PID: 553508 at fs/btrfs/extent_io.c:6849 assert_eb_page_uptodate+0x3f/0x50 CPU: 1 PID: 553508 Comm: kworker/u4:13 Tainted: G W 5.17.0-rc3+ #564 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Workqueue: btrfs-cache btrfs_work_helper RIP: 0010:assert_eb_page_uptodate+0x3f/0x50 RSP: 0018:ffffa961440a7c68 EFLAGS: 00010246 RAX: 0017ffffc0002112 RBX: ffffe6e74453f9c0 RCX: 0000000000001000 RDX: ffffe6e74467c887 RSI: ffffe6e74453f9c0 RDI: ffff8d4c5efc2fc0 RBP: 0000000000000d56 R08: ffff8d4d4a224000 R09: 0000000000000000 R10: 00015817fa9d1ef0 R11: 000000000000000c R12: 00000000000007b1 R13: ffff8d4c5efc2fc0 R14: 0000000001500000 R15: 0000000001cb1000 FS: 0000000000000000(0000) GS:ffff8d4dbbd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ff31d3448d8 CR3: 0000000118be8004 CR4: 0000000000370ee0 Call Trace: extent_buffer_test_bit+0x3f/0x70 free_space_test_bit+0xa6/0xc0 load_free_space_tree+0x1f6/0x470 caching_thread+0x454/0x630 ? rcu_read_lock_sched_held+0x12/0x60 ? rcu_read_lock_sched_held+0x12/0x60 ? rcu_read_lock_sched_held+0x12/0x60 ? lock_release+0x1f0/0x2d0 btrfs_work_helper+0xf2/0x3e0 ? lock_release+0x1f0/0x2d0 ? finish_task_switch.isra.0+0xf9/0x3a0 process_one_work+0x26d/0x580 ? process_one_work+0x580/0x580 worker_thread+0x55/0x3b0 ? process_one_work+0x580/0x580 kthread+0xf0/0x120 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 This was partially fixed by c2e39305299f01 ("btrfs: clear extent buffer uptodate when we fail to write it"), however all that fix did was keep us from finding extent buffers after a failed writeout. It didn't keep us from continuing to use a buffer that we already had found. In this case we're searching the commit root to cache the block group, so we can start committing the transaction and switch the commit root and then start writing. After the switch we can look up an extent buffer that hasn't been written yet and start processing that block group. Then we fail to write that block out and clear Uptodate on the page, and then we start spewing these errors. Normally we're protected by the tree lock to a certain degree here. If we read a block we have that block read locked, and we block the writer from locking the block before we submit it for the write. However this isn't necessarily fool proof because the read could happen before we do the submit_bio and after we locked and unlocked the extent buffer. Also in this particular case we have path->skip_locking set, so that won't save us here. We'll simply get a block that was valid when we read it, but became invalid while we were using it. What we really want is to catch the case where we've "read" a block but it's not marked Uptodate. On read we ClearPageError(), so if we're !Uptodate and !Error we know we didn't do the right thing for reading the page. Fix this by checking !Uptodate && !Error, this way we will not complain if our buffer gets invalidated while we're using it, and we'll maintain the spirit of the check which is to make sure we have a fully in-cache block while we're messing with it. CC: [email protected] # 5.4+ Signed-off-by: Josef Bacik <[email protected]> Signed-off-by: David Sterba <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Filipe Manana <[email protected]> Date: Wed Mar 2 11:48:39 2022 +0000 btrfs: fallback to blocking mode when doing async dio over multiple extents commit ca93e44bfb5fd7996b76f0f544999171f647f93b upstream. Some users recently reported that MariaDB was getting a read corruption when using io_uring on top of btrfs. This started to happen in 5.16, after commit 51bd9563b6783d ("btrfs: fix deadlock due to page faults during direct IO reads and writes"). That changed btrfs to use the new iomap flag IOMAP_DIO_PARTIAL and to disable page faults before calling iomap_dio_rw(). This was necessary to fix deadlocks when the iovector corresponds to a memory mapped file region. That type of scenario is exercised by test case generic/647 from fstests. For this MariaDB scenario, we attempt to read 16K from file offset X using IOCB_NOWAIT and io_uring. In that range we have 4 extents, each with a size of 4K, and what happens is the following: 1) btrfs_direct_read() disables page faults and calls iomap_dio_rw(); 2) iomap creates a struct iomap_dio object, its reference count is initialized to 1 and its ->size field is initialized to 0; 3) iomap calls btrfs_dio_iomap_begin() with file offset X, which finds the first 4K extent, and setups an iomap for this extent consisting of a single page; 4) At iomap_dio_bio_iter(), we are able to access the first page of the buffer (struct iov_iter) with bio_iov_iter_get_pages() without triggering a page fault; 5) iomap submits a bio for this 4K extent (iomap_dio_submit_bio() -> btrfs_submit_direct()) and increments the refcount on the struct iomap_dio object to 2; The ->size field of the struct iomap_dio object is incremented to 4K; 6) iomap calls btrfs_iomap_begin() again, this time with a file offset of X + 4K. There we setup an iomap for the next extent that also has a size of 4K; 7) Then at iomap_dio_bio_iter() we call bio_iov_iter_get_pages(), which tries to access the next page (2nd page) of the buffer. This triggers a page fault and returns -EFAULT; 8) At __iomap_dio_rw() we see the -EFAULT, but we reset the error to 0 because we passed the flag IOMAP_DIO_PARTIAL to iomap and the struct iomap_dio object has a ->size value of 4K (we submitted a bio for an extent already). The 'wait_for_completion' variable is not set to true, because our iocb has IOCB_NOWAIT set; 9) At the bottom of __iomap_dio_rw(), we decrement the reference count of the struct iomap_dio object from 2 to 1. Because we were not the only ones holding a reference on it and 'wait_for_completion' is set to false, -EIOCBQUEUED is returned to btrfs_direct_read(), which just returns it up the callchain, up to io_uring; 10) The bio submitted for the first extent (step 5) completes and its bio endio function, iomap_dio_bio_end_io(), decrements the last reference on the struct iomap_dio object, resulting in calling iomap_dio_complete_work() -> iomap_dio_complete(). 11) At iomap_dio_complete() we adjust the iocb->ki_pos from X to X + 4K and return 4K (the amount of io done) to iomap_dio_complete_work(); 12) iomap_dio_complete_work() calls the iocb completion callback, iocb->ki_complete() with a second argument value of 4K (total io done) and the iocb with the adjust ki_pos of X + 4K. This results in completing the read request for io_uring, leaving it with a result of 4K bytes read, and only the first page of the buffer filled in, while the remaining 3 pages, corresponding to the other 3 extents, were not filled; 13) For the application, the result is unexpected because if we ask to read N bytes, it expects to get N bytes read as long as those N bytes don't cross the EOF (i_size). MariaDB reports this as an error, as it's not expecting a short read, since it knows it's asking for read operations fully within the i_size boundary. This is typical in many applications, but it may also be questionable if they should react to such short reads by issuing more read calls to get the remaining data. Nevertheless, the short read happened due to a change in btrfs regarding how it deals with page faults while in the middle of a read operation, and there's no reason why btrfs can't have the previous behaviour of returning the whole data that was requested by the application. The problem can also be triggered with the following simple program: /* Get O_DIRECT */ #ifndef _GNU_SOURCE #define _GNU_SOURCE #endif #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <fcntl.h> #include <errno.h> #include <string.h> #include <liburing.h> int main(int argc, char *argv[]) { char *foo_path; struct io_uring ring; struct io_uring_sqe *sqe; struct io_uring_cqe *cqe; struct iovec iovec; int fd; long pagesize; void *write_buf; void *read_buf; ssize_t ret; int i; if (argc != 2) { fprintf(stderr, "Use: %s <directory>\n", argv[0]); return 1; } foo_path = malloc(strlen(argv[1]) + 5); if (!foo_path) { fprintf(stderr, "Failed to allocate memory for file path\n"); return 1; } strcpy(foo_path, argv[1]); strcat(foo_path, "/foo"); /* * Create file foo with 2 extents, each with a size matching * the page size. Then allocate a buffer to read both extents * with io_uring, using O_DIRECT and IOCB_NOWAIT. Before doing * the read with io_uring, access the first page of the buffer * to fault it in, so that during the read we only trigger a * page fault when accessing the second page of the buffer. */ fd = open(foo_path, O_CREAT | O_TRUNC | O_WRONLY | O_DIRECT, 0666); if (fd == -1) { fprintf(stderr, "Failed to create file 'foo': %s (errno %d)", strerror(errno), errno); return 1; } pagesize = sysconf(_SC_PAGE_SIZE); ret = posix_memalign(&write_buf, pagesize, 2 * pagesize); if (ret) { fprintf(stderr, "Failed to allocate write buffer\n"); return 1; } memset(write_buf, 0xab, pagesize); memset(write_buf + pagesize, 0xcd, pagesize); /* Create 2 extents, each with a size matching page size. */ for (i = 0; i < 2; i++) { ret = pwrite(fd, write_buf + i * pagesize, pagesize, i * pagesize); if (ret != pagesize) { fprintf(stderr, "Failed to write to file, ret = %ld errno %d (%s)\n", ret, errno, strerror(errno)); return 1; } ret = fsync(fd); if (ret != 0) { fprintf(stderr, "Failed to fsync file\n"); return 1; } } close(fd); fd = open(foo_path, O_RDONLY | O_DIRECT); if (fd == -1) { fprintf(stderr, "Failed to open file 'foo': %s (errno %d)", strerror(errno), errno); return 1; } ret = posix_memalign(&read_buf, pagesize, 2 * pagesize); if (ret) { fprintf(stderr, "Failed to allocate read buffer\n"); return 1; } /* * Fault in only the first page of the read buffer. * We want to trigger a page fault for the 2nd page of the * read buffer during the read operation with io_uring * (O_DIRECT and IOCB_NOWAIT). */ memset(read_buf, 0, 1); ret = io_uring_queue_init(1, &ring, 0); if (ret != 0) { fprintf(stderr, "Failed to create io_uring queue\n"); return 1; } sqe = io_uring_get_sqe(&ring); if (!sqe) { fprintf(stderr, "Failed to get io_uring sqe\n"); return 1; } iovec.iov_base = read_buf; iovec.iov_len = 2 * pagesize; io_uring_prep_readv(sqe, fd, &iovec, 1, 0); ret = io_uring_submit_and_wait(&ring, 1); if (ret != 1) { fprintf(stderr, "Failed at io_uring_submit_and_wait()\n"); return 1; } ret = io_uring_wait_cqe(&ring, &cqe); if (ret < 0) { fprintf(stderr, "Failed at io_uring_wait_cqe()\n"); return 1; } printf("io_uring read result for file foo:\n\n"); printf(" cqe->res == %d (expected %d)\n", cqe->res, 2 * pagesize); printf(" memcmp(read_buf, write_buf) == %d (expected 0)\n", memcmp(read_buf, write_buf, 2 * pagesize)); io_uring_cqe_seen(&ring, cqe); io_uring_queue_exit(&ring); return 0; } When running it on an unpatched kernel: $ gcc io_uring_test.c -luring $ mkfs.btrfs -f /dev/sda $ mount /dev/sda /mnt/sda $ ./a.out /mnt/sda io_uring read result for file foo: cqe->res == 4096 (expected 8192) memcmp(read_buf, write_buf) == -205 (expected 0) After this patch, the read always returns 8192 bytes, with the buffer filled with the correct data. Although that reproducer always triggers the bug in my test vms, it's possible that it will not be so reliable on other environments, as that can happen if the bio for the first extent completes and decrements the reference on the struct iomap_dio object before we do the atomic_dec_and_test() on the reference at __iomap_dio_rw(). Fix this in btrfs by having btrfs_dio_iomap_begin() return -EAGAIN whenever we try to satisfy a non blocking IO request (IOMAP_NOWAIT flag set) over a range that spans multiple extents (or a mix of extents and holes). This avoids returning success to the caller when we only did partial IO, which is not optimal for writes and for reads it's actually incorrect, as the caller doesn't expect to get less bytes read than it has requested (unless EOF is crossed), as previously mentioned. This is also the type of behaviour that xfs follows (xfs_direct_write_iomap_begin()), even though it doesn't use IOMAP_DIO_PARTIAL. A test case for fstests will follow soon. Link: https://lore.kernel.org/linux-btrfs/CABVffEM0eEWho+206m470rtM0d9J8ue85TtR-A_oVTuGLWFicA@mail.gmail.com/ Link: https://lore.kernel.org/linux-btrfs/CAHF2GV6U32gmqSjLe=XKgfcZAmLCiH26cJ2OnHGp5x=VAH4OHQ@mail.gmail.com/ CC: [email protected] # 5.16+ Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Filipe Manana <[email protected]> Date: Thu Oct 28 16:03:41 2021 +0100 btrfs: fix ENOSPC failure when attempting direct IO write into NOCOW range commit f0bfa76a11e93d0fe2c896fcb566568c5e8b5d3f upstream. When doing a direct IO write against a file range that either has preallocated extents in that range or has regular extents and the file has the NOCOW attribute set, the write fails with -ENOSPC when all of the following conditions are met: 1) There are no data blocks groups with enough free space matching the size of the write; 2) There's not enough unallocated space for allocating a new data block group; 3) The extents in the target file range are not shared, neither through snapshots nor through reflinks. This is wrong because a NOCOW write can be done in such case, and in fact it's possible to do it using a buffered IO write, since when failing to allocate data space, the buffered IO path checks if a NOCOW write is possible. The failure in direct IO write path comes from the fact that early on, at btrfs_dio_iomap_begin(), we try to allocate data space for the write and if it that fails we return the error and stop - we never check if we can do NOCOW. But later, at btrfs_get_blocks_direct_write(), we check if we can do a NOCOW write into the range, or a subset of the range, and then release the previously reserved data space. Fix this by doing the data reservation only if needed, when we must COW, at btrfs_get_blocks_direct_write() instead of doing it at btrfs_dio_iomap_begin(). This also simplifies a bit the logic and removes the inneficiency of doing unnecessary data reservations. The following example test script reproduces the problem: $ cat dio-nocow-enospc.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj # Use a small fixed size (1G) filesystem so that it's quick to fill # it up. # Make sure the mixed block groups feature is not enabled because we # later want to not have more space available for allocating data # extents but still have enough metadata space free for the file writes. mkfs.btrfs -f -b $((1024 * 1024 * 1024)) -O ^mixed-bg $DEV mount $DEV $MNT # Create our test file with the NOCOW attribute set. touch $MNT/foobar chattr +C $MNT/foobar # Now fill in all unallocated space with data for our test file. # This will allocate a data block group that will be full and leave # no (or a very small amount of) unallocated space in the device, so # that it will not be possible to allocate a new block group later. echo echo "Creating test file with initial data..." xfs_io -c "pwrite -S 0xab -b 1M 0 900M" $MNT/foobar # Now try a direct IO write against file range [0, 10M[. # This should succeed since this is a NOCOW file and an extent for the # range was previously allocated. echo echo "Trying direct IO write over allocated space..." xfs_io -d -c "pwrite -S 0xcd -b 10M 0 10M" $MNT/foobar umount $MNT When running the test: $ ./dio-nocow-enospc.sh (...) Creating test file with initial data... wrote 943718400/943718400 bytes at offset 0 900 MiB, 900 ops; 0:00:01.43 (625.526 MiB/sec and 625.5265 ops/sec) Trying direct IO write over allocated space... pwrite: No space left on device A test case for fstests will follow, testing both this direct IO write scenario as well as the buffered IO write scenario to make it less likely to get future regressions on the buffered IO case. Reviewed-by: Josef Bacik <[email protected]> Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Filipe Manana <[email protected]> Date: Thu Feb 17 12:12:02 2022 +0000 btrfs: fix lost prealloc extents beyond eof after full fsync commit d99478874355d3a7b9d86dfb5d7590d5b1754b1f upstream. When doing a full fsync, if we have prealloc extents beyond (or at) eof, and the leaves that contain them were not modified in the current transaction, we end up not logging them. This results in losing those extents when we replay the log after a power failure, since the inode is truncated to the current value of the logged i_size. Just like for the fast fsync path, we need to always log all prealloc extents starting at or beyond i_size. The fast fsync case was fixed in commit 471d557afed155 ("Btrfs: fix loss of prealloc extents past i_size after fsync log replay") but it missed the full fsync path. The problem exists since the very early days, when the log tree was added by commit e02119d5a7b439 ("Btrfs: Add a write ahead tree log to optimize synchronous operations"). Example reproducer: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt # Create our test file with many file extent items, so that they span # several leaves of metadata, even if the node/page size is 64K. Use # direct IO and not fsync/O_SYNC because it's both faster and it avoids # clearing the full sync flag from the inode - we want the fsync below # to trigger the slow full sync code path. $ xfs_io -f -d -c "pwrite -b 4K 0 16M" /mnt/foo # Now add two preallocated extents to our file without extending the # file's size. One right at i_size, and another further beyond, leaving # a gap between the two prealloc extents. $ xfs_io -c "falloc -k 16M 1M" /mnt/foo $ xfs_io -c "falloc -k 20M 1M" /mnt/foo # Make sure everything is durably persisted and the transaction is # committed. This makes all created extents to have a generation lower # than the generation of the transaction used by the next write and # fsync. sync # Now overwrite only the first extent, which will result in modifying # only the first leaf of metadata for our inode. Then fsync it. This # fsync will use the slow code path (inode full sync bit is set) because # it's the first fsync since the inode was created/loaded. $ xfs_io -c "pwrite 0 4K" -c "fsync" /mnt/foo # Extent list before power failure. $ xfs_io -c "fiemap -v" /mnt/foo /mnt/foo: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..7]: 2178048..2178055 8 0x0 1: [8..16383]: 26632..43007 16376 0x0 2: [16384..32767]: 2156544..2172927 16384 0x0 3: [32768..34815]: 2172928..2174975 2048 0x800 4: [34816..40959]: hole 6144 5: [40960..43007]: 2174976..2177023 2048 0x801 <power fail> # Mount fs again, trigger log replay. $ mount /dev/sdc /mnt # Extent list after power failure and log replay. $ xfs_io -c "fiemap -v" /mnt/foo /mnt/foo: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..7]: 2178048..2178055 8 0x0 1: [8..16383]: 26632..43007 16376 0x0 2: [16384..32767]: 2156544..2172927 16384 0x1 # The prealloc extents at file offsets 16M and 20M are missing. So fix this by calling btrfs_log_prealloc_extents() when we are doing a full fsync, so that we always log all prealloc extents beyond eof. A test case for fstests will follow soon. CC: [email protected] # 4.19+ Signed-off-by: Filipe Manana <[email protected]> Signed-off-by: David Sterba <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Omar Sandoval <[email protected]> Date: Thu Feb 17 15:14:43 2022 -0800 btrfs: fix relocation crash due to premature return from btrfs_commit_transaction() commit 5fd76bf31ccfecc06e2e6b29f8c809e934085b99 upstream. We are seeing crashes similar to the following trace: [38.969182] WARNING: CPU: 20 PID: 2105 at fs/btrfs/relocation.c:4070 btrfs_relocate_block_group+0x2dc/0x340 [btrfs] [38.973556] CPU: 20 PID: 2105 Comm: btrfs Not tainted 5.17.0-rc4 #54 [38.974580] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 [38.976539] RIP: 0010:btrfs_relocate_block_group+0x2dc/0x340 [btrfs] [38.980336] RSP: 0000:ffffb0dd42e03c20 EFLAGS: 00010206 [38.981218] RAX: ffff96cfc4ede800 RBX: ffff96cfc3ce0000 RCX: 000000000002ca14 [38.982560] RDX: 0000000000000000 RSI: 4cfd109a0bcb5d7f RDI: ffff96cfc3ce0360 [38.983619] RBP: ffff96cfc309c000 R08: 0000000000000000 R09: 0000000000000000 [38.984678] R10: ffff96cec0000001 R11: ffffe84c80000000 R12: ffff96cfc4ede800 [38.985735] R13: 0000000000000000 R14: 0000000000000000 R15: ffff96cfc3ce0360 [38.987146] FS: 00007f11c15218c0(0000) GS:ffff96d6dfb00000(0000) knlGS:0000000000000000 [38.988662] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [38.989398] CR2: 00007ffc922c8e60 CR3: 00000001147a6001 CR4: 0000000000370ee0 [38.990279] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [38.991219] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [38.992528] Call Trace: [38.992854] <TASK> [38.993148] btrfs_relocate_chunk+0x27/0xe0 [btrfs] [38.993941] btrfs_balance+0x78e/0xea0 [btrfs] [38.994801] ? vsnprintf+0x33c/0x520 [38.995368] ? __kmalloc_track_caller+0x351/0x440 [38.996198] btrfs_ioctl_balance+0x2b9/0x3a0 [btrfs] [38.997084] btrfs_ioctl+0x11b0/0x2da0 [btrfs] [38.997867] ? mod_objcg_state+0xee/0x340 [38.998552] ? seq_release+0x24/0x30 [38.999184] ? proc_nr_files+0x30/0x30 [38.999654] ? call_rcu+0xc8/0x2f0 [39.000228] ? __x64_sys_ioctl+0x84/0xc0 [39.000872] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] [39.001973] __x64_sys_ioctl+0x84/0xc0 [39.002566] do_syscall_64+0x3a/0x80 [39.003011] entry_SYSCALL_64_after_hwframe+0x44/0xae [39.003735] RIP: 0033:0x7f11c166959b [39.007324] RSP: 002b:00007fff2543e998 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [39.008521] RAX: ffffffffffffffda RBX: 00007f11c1521698 RCX: 00007f11c166959b [39.009833] RDX: 00007fff2543ea40 RSI: 00000000c4009420 RDI: 0000000000000003 [39.011270] RBP: 0000000000000003 R08: 0000000000000013 R09: 00007f11c16f94e0 [39.012581] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff25440df3 [39.014046] R13: 0000000000000000 R14: 00007fff2543ea40 R15: 0000000000000001 [39.015040] </TASK> [39.015418] ---[ end trace 0000000000000000 ]--- [43.131559] ------------[ cut here ]------------ [43.132234] kernel BUG at fs/btrfs/extent-tree.c:2717! [43.133031] invalid opcode: 0000 [#1] PREEMPT SMP PTI [43.133702] CPU: 1 PID: 1839 Comm: btrfs Tainted: G W 5.17.0-rc4 #54 [43.134863] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 [43.136426] RIP: 0010:unpin_extent_range+0x37a/0x4f0 [btrfs] [43.139913] RSP: 0000:ffffb0dd4216bc70 EFLAGS: 00010246 [43.140629] RAX: 0000000000000000 RBX: ffff96cfc34490f8 RCX: 0000000000000001 [43.141604] RDX: 0000000080000001 RSI: 0000000051d00000 RDI: 00000000ffffffff [43.142645] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff96cfd07dca50 [43.143669] R10: ffff96cfc46e8a00 R11: fffffffffffec000 R12: 0000000041d00000 [43.144657] R13: ffff96cfc3ce0000 R14: ffffb0dd4216bd08 R15: 0000000000000000 [43.145686] FS: 00007f7657dd68c0(0000) GS:ffff96d6df640000(0000) knlGS:0000000000000000 [43.146808] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [43.147584] CR2: 00007f7fe81bf5b0 CR3: 00000001093ee004 CR4: 0000000000370ee0 [43.148589] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [43.149581] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [43.150559] Call Trace: [43.150904] <TASK> [43.151253] btrfs_finish_extent_commit+0x88/0x290 [btrfs] [43.152127] btrfs_commit_transaction+0x74f/0xaa0 [btrfs] [43.152932] ? btrfs_attach_transaction_barrier+0x1e/0x50 [btrfs] [43.153786] btrfs_ioctl+0x1edc/0x2da0 [btrfs] [43.154475] ? __check_object_size+0x150/0x170 [43.155170] ? preempt_count_add+0x49/0xa0 [43.155753] ? __x64_sys_ioctl+0x84/0xc0 [43.156437] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] [43.157456] __x64_sys_ioctl+0x84/0xc0 [43.157980] do_syscall_64+0x3a/0x80 [43.158543] entry_SYSCALL_64_after_hwframe+0x44/0xae [43.159231] RIP: 0033:0x7f7657f1e59b [43.161819] RSP: 002b:00007ffda5cd1658 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [43.162702] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f7657f1e59b [43.163526] RDX: 0000000000000000 RSI: 0000000000009408 RDI: 0000000000000003 [43.164358] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000 [43.165208] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [43.166029] R13: 00005621b91c3232 R14: 00005621b91ba580 R15: 00007ffda5cd1800 [43.166863] </TASK> [43.167125] Modules linked in: btrfs blake2b_generic xor pata_acpi ata_piix libata raid6_pq scsi_mod libcrc32c virtio_net virtio_rng net_failover rng_core failover scsi_common [43.169552] ---[ end trace 0000000000000000 ]--- [43.171226] RIP: 0010:unpin_extent_range+0x37a/0x4f0 [btrfs] [43.174767] RSP: 0000:ffffb0dd4216bc70 EFLAGS: 00010246 [43.175600] RAX: 0000000000000000 RBX: ffff96cfc34490f8 RCX: 0000000000000001 [43.176468] RDX: 0000000080000001 RSI: 0000000051d00000 RDI: 00000000ffffffff [43.177357] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff96cfd07dca50 [43.178271] R10: ffff96cfc46e8a00 R11: fffffffffffec000 R12: 0000000041d00000 [43.179178] R13: ffff96cfc3ce0000 R14: ffffb0dd4216bd08 R15: 0000000000000000 [43.180071] FS: 00007f7657dd68c0(0000) GS:ffff96d6df800000(0000) knlGS:0000000000000000 [43.181073] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [43.181808] CR2: 00007fe09905f010 CR3: 00000001093ee004 CR4: 0000000000370ee0 [43.182706] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [43.183591] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 We first hit the WARN_ON(rc->block_group->pinned > 0) in btrfs_relocate_block_group() and then the BUG_ON(!cache) in unpin_extent_range(). This tells us that we are exiting relocation and removing the block group with bytes still pinned for that block group. This is supposed to be impossible: the last thing relocate_block_group() does is commit the transaction to get rid of pinned extents. Commit d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit") introduced an optimization so that commits from fsync don't have to wait for the previous commit to unpin extents. This was only intended to affect fsync, but it inadvertently made it possible for any commit to skip waiting for the previous commit to unpin. This is because if a call to btrfs_commit_transaction() finds that another thread is already committing the transaction, it waits for the other thread to complete the commit and then returns. If that other thread was in fsync, then it completes the commit without completing the previous commit. This makes the following sequence of events possible: Thread 1____________________|Thread 2 (fsync)_____________________|Thread 3 (balance)___________________ btrfs_commit_transaction(N) | | btrfs_run_delayed_refs | | pin extents | | ... | | state = UNBLOCKED |btrfs_sync_file | | btrfs_start_transaction(N + 1) |relocate_block_group | | btrfs_join_transaction(N + 1) | btrfs_commit_transaction(N + 1) | ... | trans->state = COMMIT_START | | | btrfs_commit_transaction(N + 1) | | wait_for_commit(N + 1, COMPLETED) | wait_for_commit(N, SUPER_COMMITTED)| state = SUPER_COMMITTED | ... | btrfs_finish_extent_commit| | unpin_extent_range() | trans->state = COMPLETED | | | return | | ... | |Thread 1 isn't done, so pinned > 0 | |and we WARN | | | |btrfs_remove_block_group unpin_extent_range() | | Thread 3 removed the | | block group, so we BUG| | There are other sequences involving SUPER_COMMITTED transactions that can cause a similar outcome. We could fix this by making relocation explicitly wait for unpinning, but there may be other cases that need it. Josef mentioned ENOSPC flushing and the free space cache inode as other potential victims. Rather than playing whack-a-mole, this fix is conservative and makes all commits not in fsync wait for all previous transactions, which is what the optimization intended. Fixes: d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit") CC: [email protected] # 5.15+ Reviewed-by: Filipe Manana <[email protected]> Signed-off-by: Omar Sandoval <[email protected]> Signed-off-by: David Sterba <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Filipe Manana <[email protected]> Date: Wed Feb 2 15:26:09 2022 +0000 btrfs: get rid of warning on transaction commit when using flushoncommit [ Upstream commit a0f0cf8341e34e5d2265bfd3a7ad68342da1e2aa ] When using the flushoncommit mount option, during almost every transaction commit we trigger a warning from __writeback_inodes_sb_nr(): $ cat fs/fs-writeback.c: (...) static void __writeback_inodes_sb_nr(struct super_block *sb, ... { (...) WARN_ON(!rwsem_is_locked(&sb->s_umount)); (...) } (...) The trace produced in dmesg looks like the following: [947.473890] WARNING: CPU: 5 PID: 930 at fs/fs-writeback.c:2610 __writeback_inodes_sb_nr+0x7e/0xb3 [947.481623] Modules linked in: nfsd nls_cp437 cifs asn1_decoder cifs_arc4 fscache cifs_md4 ipmi_ssif [947.489571] CPU: 5 PID: 930 Comm: btrfs-transacti Not tainted 95.16.3-srb-asrock-00001-g36437ad63879 #186 [947.497969] RIP: 0010:__writeback_inodes_sb_nr+0x7e/0xb3 [947.502097] Code: 24 10 4c 89 44 24 18 c6 (...) [947.519760] RSP: 0018:ffffc90000777e10 EFLAGS: 00010246 [947.523818] RAX: 0000000000000000 RBX: 0000000000963300 RCX: 0000000000000000 [947.529765] RDX: 0000000000000000 RSI: 000000000000fa51 RDI: ffffc90000777e50 [947.535740] RBP: ffff888101628a90 R08: ffff888100955800 R09: ffff888100956000 [947.541701] R10: 0000000000000002 R11: 0000000000000001 R12: ffff888100963488 [947.547645] R13: ffff888100963000 R14: ffff888112fb7200 R15: ffff888100963460 [947.553621] FS: 0000000000000000(0000) GS:ffff88841fd40000(0000) knlGS:0000000000000000 [947.560537] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [947.565122] CR2: 0000000008be50c4 CR3: 000000000220c000 CR4: 00000000001006e0 [947.571072] Call Trace: [947.572354] <TASK> [947.573266] btrfs_commit_transaction+0x1f1/0x998 [947.576785] ? start_transaction+0x3ab/0x44e [947.579867] ? schedule_timeout+0x8a/0xdd [947.582716] transaction_kthread+0xe9/0x156 [947.585721] ? btrfs_cleanup_transaction.isra.0+0x407/0x407 [947.590104] kthread+0x131/0x139 [947.592168] ? set_kthread_struct+0x32/0x32 [947.595174] ret_from_fork+0x22/0x30 [947.597561] </TASK> [947.598553] ---[ end trace 644721052755541c ]--- This is because we started using writeback_inodes_sb() to flush delalloc when committing a transaction (when using -o flushoncommit), in order to avoid deadlocks with filesystem freeze operations. This change was made by commit ce8ea7cc6eb313 ("btrfs: don't call btrfs_start_delalloc_roots in flushoncommit"). After that change we started producing that warning, and every now and then a user reports this since the warning happens too often, it spams dmesg/syslog, and a user is unsure if this reflects any problem that might compromise the filesystem's reliability. We can not just lock the sb->s_umount semaphore before calling writeback_inodes_sb(), because that would at least deadlock with filesystem freezing, since at fs/super.c:freeze_super() sync_filesystem() is called while we are holding that semaphore in write mode, and that can trigger a transaction commit, resulting in a deadlock. It would also trigger the same type of deadlock in the unmount path. Possibly, it could also introduce some other locking dependencies that lockdep would report. To fix this call try_to_writeback_inodes_sb() instead of writeback_inodes_sb(), because that will try to read lock sb->s_umount and then will only call writeback_inodes_sb() if it was able to lock it. This is fine because the cases where it can't read lock sb->s_umount are during a filesystem unmount or during a filesystem freeze - in those cases sb->s_umount is write locked and sync_filesystem() is called, which calls writeback_inodes_sb(). In other words, in all cases where we can't take a read lock on sb->s_umount, writeback is already being triggered elsewhere. An alternative would be to call btrfs_start_delalloc_roots() with a number of pages different from LONG_MAX, for example matching the number of delalloc bytes we currently have, in which case we would end up starting all delalloc with filemap_fdatawrite_wbc() and not with an async flush via filemap_flush() - that is only possible after the rather recent commit e076ab2a2ca70a ("btrfs: shrink delalloc pages instead of full inodes"). However that creates a whole new can of worms due to new lock dependencies, which lockdep complains, like for example: [ 8948.247280] ====================================================== [ 8948.247823] WARNING: possible circular locking dependency detected [ 8948.248353] 5.17.0-rc1-btrfs-next-111 #1 Not tainted [ 8948.248786] ------------------------------------------------------ [ 8948.249320] kworker/u16:18/933570 is trying to acquire lock: [ 8948.249812] ffff9b3de1591690 (sb_internal#2){.+.+}-{0:0}, at: find_free_extent+0x141e/0x1590 [btrfs] [ 8948.250638] but task is already holding lock: [ 8948.251140] ffff9b3e09c717d8 (&root->delalloc_mutex){+.+.}-{3:3}, at: start_delalloc_inodes+0x78/0x400 [btrfs] [ 8948.252018] which lock already depends on the new lock. [ 8948.252710] the existing dependency chain (in reverse order) is: [ 8948.253343] -> #2 (&root->delalloc_mutex){+.+.}-{3:3}: [ 8948.253950] __mutex_lock+0x90/0x900 [ 8948.254354] start_delalloc_inodes+0x78/0x400 [btrfs] [ 8948.254859] btrfs_start_delalloc_roots+0x194/0x2a0 [btrfs] [ 8948.255408] btrfs_commit_transaction+0x32f/0xc00 [btrfs] [ 8948.255942] btrfs_mksubvol+0x380/0x570 [btrfs] [ 8948.256406] btrfs_mksnapshot+0x81/0xb0 [btrfs] [ 8948.256870] __btrfs_ioctl_snap_create+0x17f/0x190 [btrfs] [ 8948.257413] btrfs_ioctl_snap_create_v2+0xbb/0x140 [btrfs] [ 8948.257961] btrfs_ioctl+0x1196/0x3630 [btrfs] [ 8948.258418] __x64_sys_ioctl+0x83/0xb0 [ 8948.258793] do_syscall_64+0x3b/0xc0 [ 8948.259146] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 8948.259709] -> #1 (&fs_info->delalloc_root_mutex){+.+.}-{3:3}: [ 8948.260330] __mutex_lock+0x90/0x900 [ 8948.260692] btrfs_start_delalloc_roots+0x97/0x2a0 [btrfs] [ 8948.261234] btrfs_commit_transaction+0x32f/0xc00 [btrfs] [ 8948.261766] btrfs_set_free_space_cache_v1_active+0x38/0x60 [btrfs] [ 8948.262379] btrfs_start_pre_rw_mount+0x119/0x180 [btrfs] [ 8948.262909] open_ctree+0x1511/0x171e [btrfs] [ 8948.263359] btrfs_mount_root.cold+0x12/0xde [btrfs] [ 8948.263863] legacy_get_tree+0x30/0x50 [ 8948.264242] vfs_get_tree+0x28/0xc0 [ 8948.264594] vfs_kern_mount.part.0+0x71/0xb0 [ 8948.265017] btrfs_mount+0x11d/0x3a0 [btrfs] [ 8948.265462] legacy_get_tree+0x30/0x50 [ 8948.265851] vfs_get_tree+0x28/0xc0 [ 8948.266203] path_mount+0x2d4/0xbe0 [ 8948.266554] __x64_sys_mount+0x103/0x140 [ 8948.266940] do_syscall_64+0x3b/0xc0 [ 8948.267300] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 8948.267790] -> #0 (sb_internal#2){.+.+}-{0:0}: [ 8948.268322] __lock_acquire+0x12e8/0x2260 [ 8948.268733] lock_acquire+0xd7/0x310 [ 8948.269092] start_transaction+0x44c/0x6e0 [btrfs] [ 8948.269591] find_free_extent+0x141e/0x1590 [btrfs] [ 8948.270087] btrfs_reserve_extent+0x14b/0x280 [btrfs] [ 8948.270588] cow_file_range+0x17e/0x490 [btrfs] [ 8948.271051] btrfs_run_delalloc_range+0x345/0x7a0 [btrfs] [ 8948.271586] writepage_delalloc+0xb5/0x170 [btrfs] [ 8948.272071] __extent_writepage+0x156/0x3c0 [btrfs] [ 8948.272579] extent_write_cache_pages+0x263/0x460 [btrfs] [ 8948.273113] extent_writepages+0x76/0x130 [btrfs] [ 8948.273573] do_writepages+0xd2/0x1c0 [ 8948.273942] filemap_fdatawrite_wbc+0x68/0x90 [ 8948.274371] start_delalloc_inodes+0x17f/0x400 [btrfs] [ 8948.274876] btrfs_start_delalloc_roots+0x194/0x2a0 [btrfs] [ 8948.275417] flush_space+0x1f2/0x630 [btrfs] [ 8948.275863] btrfs_async_reclaim_data_space+0x108/0x1b0 [btrfs] [ 8948.276438] process_one_work+0x252/0x5a0 [ 8948.276829] worker_thread+0x55/0x3b0 [ 8948.277189] kthread+0xf2/0x120 [ 8948.277506] ret_from_fork+0x22/0x30 [ 8948.277868] other info that might help us debug this: [ 8948.278548] Chain exists of: sb_internal#2 --> &fs_info->delalloc_root_mutex --> &root->delalloc_mutex [ 8948.279601] Possible unsafe locking scenario: [ 8948.280102] CPU0 CPU1 [ 8948.280508] ---- ---- [ 8948.280915] lock(&root->delalloc_mutex); [ 8948.281271] lock(&fs_info->delalloc_root_mutex); [ 8948.281915] lock(&root->delalloc_mutex); [ 8948.282487] lock(sb_internal#2); [ 8948.282800] *** DEADLOCK *** [ 8948.283333] 4 locks held by kworker/u16:18/933570: [ 8948.283750] #0: ffff9b3dc00a9d48 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x1d2/0x5a0 [ 8948.284609] #1: ffffa90349dafe70 ((work_completion)(&fs_info->async_data_reclaim_work)){+.+.}-{0:0}, at: process_one_work+0x1d2/0x5a0 [ 8948.285637] #2: ffff9b3e14db5040 (&fs_info->delalloc_root_mutex){+.+.}-{3:3}, at: btrfs_start_delalloc_roots+0x97/0x2a0 [btrfs] [ 8948.286674] #3: ffff9b3e09c717d8 (&root->delalloc_mutex){+.+.}-{3:3}, at: start_delalloc_inodes+0x78/0x400 [btrfs] [ 8948.287596] stack backtrace: [ 8948.287975] CPU: 3 PID: 933570 Comm: kworker/u16:18 Not tainted 5.17.0-rc1-btrfs-next-111 #1 [ 8948.288677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 [ 8948.289649] Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs] [ 8948.290298] Call Trace: [ 8948.290517] <TASK> [ 8948.290700] dump_stack_lvl+0x59/0x73 [ 8948.291026] check_noncircular+0xf3/0x110 [ 8948.291375] ? start_transaction+0x228/0x6e0 [btrfs] [ 8948.291826] __lock_acquire+0x12e8/0x2260 [ 8948.292241] lock_acquire+0xd7/0x310 [ 8948.292714] ? find_free_extent+0x141e/0x1590 [btrfs] [ 8948.293241] ? lock_is_held_type+0xea/0x140 [ 8948.293601] start_transaction+0x44c/0x6e0 [btrfs] [ 8948.294055] ? find_free_extent+0x141e/0x1590 [btrfs] [ 8948.294518] find_free_extent+0x141e/0x1590 [btrfs] [ 8948.294957] ? _raw_spin_unlock+0x29/0x40 [ 8948.295312] ? btrfs_get_alloc_profile+0x124/0x290 [btrfs] [ 8948.295813] btrfs_reserve_extent+0x14b/0x280 [btrfs] [ 8948.296270] cow_file_range+0x17e/0x490 [btrfs] [ 8948.296691] btrfs_run_delalloc_range+0x345/0x7a0 [btrfs] [ 8948.297175] ? find_lock_delalloc_range+0x247/0x270 [btrfs] [ 8948.297678] writepage_delalloc+0xb5/0x170 [btrfs] [ 8948.298123] __extent_writepage+0x156/0x3c0 [btrfs] [ 8948.298570] extent_write_cache_pages+0x263/0x460 [btrfs] [ 8948.299061] extent_writepages+0x76/0x130 [btrfs] [ 8948.299495] do_writepages+0xd2/0x1c0 [ 8948.299817] ? sched_clock_cpu+0xd/0x110 [ 8948.300160] ? lock_release+0x155/0x4a0 [ 8948.300494] filemap_fdatawrite_wbc+0x68/0x90 [ 8948.300874] ? do_raw_spin_unlock+0x4b/0xa0 [ 8948.301243] start_delalloc_inodes+0x17f/0x400 [btrfs] [ 8948.301706] ? lock_release+0x155/0x4a0 [ 8948.302055] btrfs_start_delalloc_roots+0x194/0x2a0 [btrfs] [ 8948.302564] flush_space+0x1f2/0x630 [btrfs] [ 8948.302970] btrfs_async_reclaim_data_space+0x108/0x1b0 [btrfs] [ 8948.303510] process_one_work+0x252/0x5a0 [ 8948.303860] ? process_one_work+0x5a0/0x5a0 [ 8948.304221] worker_thread+0x55/0x3b0 [ 8948.304543] ? process_one_work+0x5a0/0x5a0 [ 8948.304904] kthread+0xf2/0x120 [ 8948.305184] ? kthread_complete_and_exit+0x20/0x20 [ 8948.305598] ret_from_fork+0x22/0x30 [ 8948.305921] </TASK> It all comes from the fact that btrfs_start_delalloc_roots() takes the delalloc_root_mutex, in the transaction commit path we are holding a read lock on one of the superblock's freeze semaphores (via sb_start_intwrite()), the async reclaim task can also do a call to btrfs_start_delalloc_roots(), which ends up triggering writeback with calls to filemap_fdatawrite_wbc(), resulting in extent allocation which in turn can call btrfs_start_transaction(), which will result in taking the freeze semaphore via sb_start_intwrite(), forming a nasty dependency on all those locks which can be taken in different orders by different code paths. So just adopt the simple approach of calling try_to_writeback_inodes_sb() at btrfs_start_delalloc_flush(). Link: https://lore.kernel.org/linux-btrfs/[email protected]/ Link: https://lore.kernel.org/linux-btrfs/[email protected]/ Link: https://lore.kernel.org/linux-btrfs/[email protected]/ Link: https://lore.kernel.org/linux-btrfs/[email protected]/ Reviewed-by: Omar Sandoval <[email protected]> Signed-off-by: Filipe Manana <[email protected]> [ add more link reports ] Signed-off-by: David Sterba <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Sidong Yang <[email protected]> Date: Mon Feb 28 01:43:40 2022 +0000 btrfs: qgroup: fix deadlock between rescan worker and remove qgroup commit d4aef1e122d8bbdc15ce3bd0bc813d6b44a7d63a upstream. The commit e804861bd4e6 ("btrfs: fix deadlock between quota disable and qgroup rescan worker") by Kawasaki resolves deadlock between quota disable and qgroup rescan worker. But also there is a deadlock case like it. It's about enabling or disabling quota and creating or removing qgroup. It can be reproduced in simple script below. for i in {1..100} do btrfs quota enable /mnt & btrfs qgroup create 1/0 /mnt & btrfs qgroup destroy 1/0 /mnt & btrfs quota disable /mnt & done Here's why the deadlock happens: 1) The quota rescan task is running. 2) Task A calls btrfs_quota_disable(), locks the qgroup_ioctl_lock mutex, and then calls btrfs_qgroup_wait_for_completion(), to wait for the quota rescan task to complete. 3) Task B calls btrfs_remove_qgroup() and it blocks when trying to lock the qgroup_ioctl_lock mutex, because it's being held by task A. At that point task B is holding a transaction handle for the current transaction. 4) The quota rescan task calls btrfs_commit_transaction(). This results in it waiting for all other tasks to release their handles on the transaction, but task B is blocked on the qgroup_ioctl_lock mutex while holding a handle on the transaction, and that mutex is being held by task A, which is waiting for the quota rescan task to complete, resulting in a deadlock between these 3 tasks. To resolve this issue, the thread disabling quota should unlock qgroup_ioctl_lock before waiting rescan completion. Move btrfs_qgroup_wait_for_completion() after unlock of qgroup_ioctl_lock. Fixes: e804861bd4e6 ("btrfs: fix deadlock between quota disable and qgroup rescan worker") CC: [email protected] # 5.4+ Reviewed-by: Filipe Manana <[email protected]> Reviewed-by: Shin'ichiro Kawasaki <[email protected]> Signed-off-by: Sidong Yang <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Qu Wenruo <[email protected]> Date: Fri Feb 18 10:13:00 2022 +0800 btrfs: subpage: fix a wrong check on subpage->writers commit c992fa1fd52380d0c4ced7b07479e877311ae645 upstream. [BUG] When looping btrfs/074 with 64K page size and 4K sectorsize, there is a low chance (1/50~1/100) to crash with the following ASSERT() triggered in btrfs_subpage_start_writer(): ret = atomic_add_return(nbits, &subpage->writers); ASSERT(ret == nbits); <<< This one <<< [CAUSE] With more debugging output on the parameters of btrfs_subpage_start_writer(), it shows a very concerning error: ret=29 nbits=13 start=393216 len=53248 For @nbits it's correct, but @ret which is the returned value from atomic_add_return(), it's not only larger than nbits, but also larger than max sectors per page value (for 64K page size and 4K sector size, it's 16). This indicates that some call sites are not properly decreasing the value. And that's exactly the case, in btrfs_page_unlock_writer(), due to the fact that we can have page locked either by lock_page() or process_one_page(), we have to check if the subpage has any writer. If no writers, it's locked by lock_page() and we only need to unlock it. But unfortunately the check for the writers are completely opposite: if (atomic_read(&subpage->writers)) /* No writers, locked by plain lock_page() */ return unlock_page(page); We directly unlock the page if it has writers, which is the completely opposite what we want. Thankfully the affected call site is only limited to extent_write_locked_range(), so it's mostly affecting compressed write. [FIX] Just fix the wrong check condition to fix the bug. Fixes: e55a0de18572 ("btrfs: rework page locking in __extent_writepage()") CC: [email protected] # 5.16 Signed-off-by: Qu Wenruo <[email protected]> Signed-off-by: David Sterba <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Vincent Mailhol <[email protected]> Date: Sat Feb 12 20:27:13 2022 +0900 can: etas_es58x: change opened_channel_cnt's type from atomic_t to u8 [ Upstream commit f4896248e9025ff744b4147e6758274a1cb8cbae ] The driver uses an atomic_t variable: struct es58x_device::opened_channel_cnt to keep track of the number of opened channels in order to only allocate memory for the URBs when this count changes from zero to one. While the intent was to prevent race conditions, the choice of an atomic_t turns out to be a bad idea for several reasons: - implementation is incorrect and fails to decrement opened_channel_cnt when the URB allocation fails as reported in [1]. - even if opened_channel_cnt were to be correctly decremented, atomic_t is insufficient to cover edge cases: there can be a race condition in which 1/ a first process fails to allocate URBs memory 2/ a second process enters es58x_open() before the first process does its cleanup and decrements opened_channed_cnt. In which case, the second process would successfully return despite the URBs memory not being allocated. - actually, any kind of locking mechanism was useless here because it is redundant with the network stack big kernel lock (a.k.a. rtnl_lock) which is being hold by all the callers of net_device_ops:ndo_open() and net_device_ops:ndo_close(). c.f. the ASSERST_RTNL() calls in __dev_open() [2] and __dev_close_many() [3]. The atmomic_t is thus replaced by a simple u8 type and the logic to increment and decrement es58x_device:opened_channel_cnt is simplified accordingly fixing the bug reported in [1]. We do not check again for ASSERST_RTNL() as this is already done by the callers. [1] https://lore.kernel.org/linux-can/20220201140351.GA2548@kili/T/#u [2] https://elixir.bootlin.com/linux/v5.16/source/net/core/dev.c#L1463 [3] https://elixir.bootlin.com/linux/v5.16/source/net/core/dev.c#L1541 Fixes: 8537257874e9 ("can: etas_es58x: add core support for ETAS ES58X CAN USB interfaces") Link: https://lore.kernel.org/all/[email protected] Reported-by: Dan Carpenter <[email protected]> Signed-off-by: Vincent Mailhol <[email protected]> Signed-off-by: Marc Kleine-Budde <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Vincent Mailhol <[email protected]> Date: Tue Feb 15 08:48:14 2022 +0900 can: gs_usb: change active_channels's type from atomic_t to u8 commit 035b0fcf02707d3c9c2890dc1484b11aa5335eb1 upstream. The driver uses an atomic_t variable: gs_usb:active_channels to keep track of the number of opened channels in order to only allocate memory for the URBs when this count changes from zero to one. However, the driver does not decrement the counter when an error occurs in gs_can_open(). This issue is fixed by changing the type from atomic_t to u8 and by simplifying the logic accordingly. It is safe to use an u8 here because the network stack big kernel lock (a.k.a. rtnl_mutex) is being hold. For details, please refer to [1]. [1] https://lore.kernel.org/linux-can/CAMZ6Rq+sHpiw34ijPsmp7vbUpDtJwvVtdV7CvRZJsLixjAFfrg@mail.gmail.com/T/#t Fixes: d08e973a77d1 ("can: gs_usb: Added support for the GS_USB CAN devices") Link: https://lore.kernel.org/all/[email protected] Signed-off-by: Vincent Mailhol <[email protected]> Signed-off-by: Marc Kleine-Budde <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Ronnie Sahlberg <[email protected]> Date: Sat Feb 12 08:16:20 2022 +1000 cifs: do not use uninitialized data in the owner/group sid [ Upstream commit 26d3dadebbcbddfaf1d9caad42527a28a0ed28d8 ] When idsfromsid is used we create a special SID for owner/group. This structure must be initialized or else the first 5 bytes of the Authority field of the SID will contain uninitialized data and thus not be a valid SID. Signed-off-by: Ronnie Sahlberg <[email protected]> Signed-off-by: Steve French <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Ronnie Sahlberg <[email protected]> Date: Fri Feb 11 02:59:15 2022 +1000 cifs: fix double free race when mount fails in cifs_get_root() [ Upstream commit 3d6cc9898efdfb062efb74dc18cfc700e082f5d5 ] When cifs_get_root() fails during cifs_smb3_do_mount() we call deactivate_locked_super() which eventually will call delayed_free() which will free the context. In this situation we should not proceed to enter the out: section in cifs_smb3_do_mount() and free the same resources a second time. [Thu Feb 10 12:59:06 2022] BUG: KASAN: use-after-free in rcu_cblist_dequeue+0x32/0x60 [Thu Feb 10 12:59:06 2022] Read of size 8 at addr ffff888364f4d110 by task swapper/1/0 [Thu Feb 10 12:59:06 2022] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G OE 5.17.0-rc3+ #4 [Thu Feb 10 12:59:06 2022] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.0 12/17/2019 [Thu Feb 10 12:59:06 2022] Call Trace: [Thu Feb 10 12:59:06 2022] <IRQ> [Thu Feb 10 12:59:06 2022] dump_stack_lvl+0x5d/0x78 [Thu Feb 10 12:59:06 2022] print_address_description.constprop.0+0x24/0x150 [Thu Feb 10 12:59:06 2022] ? rcu_cblist_dequeue+0x32/0x60 [Thu Feb 10 12:59:06 2022] kasan_report.cold+0x7d/0x117 [Thu Feb 10 12:59:06 2022] ? rcu_cblist_dequeue+0x32/0x60 [Thu Feb 10 12:59:06 2022] __asan_load8+0x86/0xa0 [Thu Feb 10 12:59:06 2022] rcu_cblist_dequeue+0x32/0x60 [Thu Feb 10 12:59:06 2022] rcu_core+0x547/0xca0 [Thu Feb 10 12:59:06 2022] ? call_rcu+0x3c0/0x3c0 [Thu Feb 10 12:59:06 2022] ? __this_cpu_preempt_check+0x13/0x20 [Thu Feb 10 12:59:06 2022] ? lock_is_held_type+0xea/0x140 [Thu Feb 10 12:59:06 2022] rcu_core_si+0xe/0x10 [Thu Feb 10 12:59:06 2022] __do_softirq+0x1d4/0x67b [Thu Feb 10 12:59:06 2022] __irq_exit_rcu+0x100/0x150 [Thu Feb 10 12:59:06 2022] irq_exit_rcu+0xe/0x30 [Thu Feb 10 12:59:06 2022] sysvec_hyperv_stimer0+0x9d/0xc0 ... [Thu Feb 10 12:59:07 2022] Freed by task 58179: [Thu Feb 10 12:59:07 2022] kasan_save_stack+0x26/0x50 [Thu Feb 10 12:59:07 2022] kasan_set_track+0x25/0x30 [Thu Feb 10 12:59:07 2022] kasan_set_free_info+0x24/0x40 [Thu Feb 10 12:59:07 2022] ____kasan_slab_free+0x137/0x170 [Thu Feb 10 12:59:07 2022] __kasan_slab_free+0x12/0x20 [Thu Feb 10 12:59:07 2022] slab_free_freelist_hook+0xb3/0x1d0 [Thu Feb 10 12:59:07 2022] kfree+0xcd/0x520 [Thu Feb 10 12:59:07 2022] cifs_smb3_do_mount+0x149/0xbe0 [cifs] [Thu Feb 10 12:59:07 2022] smb3_get_tree+0x1a0/0x2e0 [cifs] [Thu Feb 10 12:59:07 2022] vfs_get_tree+0x52/0x140 [Thu Feb 10 12:59:07 2022] path_mount+0x635/0x10c0 [Thu Feb 10 12:59:07 2022] __x64_sys_mount+0x1bf/0x210 [Thu Feb 10 12:59:07 2022] do_syscall_64+0x5c/0xc0 [Thu Feb 10 12:59:07 2022] entry_SYSCALL_64_after_hwframe+0x44/0xae [Thu Feb 10 12:59:07 2022] Last potentially related work creation: [Thu Feb 10 12:59:07 2022] kasan_save_stack+0x26/0x50 [Thu Feb 10 12:59:07 2022] __kasan_record_aux_stack+0xb6/0xc0 [Thu Feb 10 12:59:07 2022] kasan_record_aux_stack_noalloc+0xb/0x10 [Thu Feb 10 12:59:07 2022] call_rcu+0x76/0x3c0 [Thu Feb 10 12:59:07 2022] cifs_umount+0xce/0xe0 [cifs] [Thu Feb 10 12:59:07 2022] cifs_kill_sb+0xc8/0xe0 [cifs] [Thu Feb 10 12:59:07 2022] deactivate_locked_super+0x5d/0xd0 [Thu Feb 10 12:59:07 2022] cifs_smb3_do_mount+0xab9/0xbe0 [cifs] [Thu Feb 10 12:59:07 2022] smb3_get_tree+0x1a0/0x2e0 [cifs] [Thu Feb 10 12:59:07 2022] vfs_get_tree+0x52/0x140 [Thu Feb 10 12:59:07 2022] path_mount+0x635/0x10c0 [Thu Feb 10 12:59:07 2022] __x64_sys_mount+0x1bf/0x210 [Thu Feb 10 12:59:07 2022] do_syscall_64+0x5c/0xc0 [Thu Feb 10 12:59:07 2022] entry_SYSCALL_64_after_hwframe+0x44/0xae Reported-by: Shyam Prasad N <[email protected]> Reviewed-by: Shyam Prasad N <[email protected]> Signed-off-by: Ronnie Sahlberg <[email protected]> Signed-off-by: Steve French <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Ronnie Sahlberg <[email protected]> Date: Mon Feb 14 08:40:52 2022 +1000 cifs: modefromsids must add an ACE for authenticated users [ Upstream commit 0c6f4ebf8835d01866eb686d47578cde80097981 ] When we create a file with modefromsids we set an ACL that has one ACE for the magic modefromsid as well as a second ACE that grants full access to all authenticated users. When later we chante the mode on the file we strip away this, and other, ACE for authenticated users in set_chmod_dacl() and then just add back/update the modefromsid ACE. Thus leaving the file with a single ACE that is for the mode and no ACE to grant any user any rights to access the file. Fix this by always adding back also the modefromsid ACE so that we do not drop the rights to access the file. Signed-off-by: Ronnie Sahlberg <[email protected]> Reviewed-by: Shyam Prasad N <[email protected]> Signed-off-by: Steve French <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Yongzhi Liu <[email protected]> Date: Sat Jan 15 21:34:56 2022 -0800 dmaengine: shdma: Fix runtime PM imbalance on error [ Upstream commit 455896c53d5b803733ddd84e1bf8a430644439b6 ] pm_runtime_get_() increments the runtime PM usage counter even when it returns an error code, thus a matching decrement is needed on the error handling path to keep the counter balanced. Signed-off-by: Yongzhi Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Vinod Koul <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Leo (Hanghong) Ma <[email protected]> Date: Fri Nov 12 10:11:35 2021 -0500 drm/amd/display: Reduce dmesg error to a debug print commit 1d925758ba1a5d2716a847903e2fd04efcbd9862 upstream. [Why & How] Dmesg errors are found on dcn3.1 during reset test, but it's not a really failure. So reduce it to a debug print. Signed-off-by: Leo (Hanghong) Ma <[email protected]> Reviewed-by: Nicholas Kazlauskas <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Cc: Mario Limonciello <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Evan Quan <[email protected]> Date: Tue Jan 18 14:07:51 2022 +0800 drm/amd/pm: correct UMD pstate clocks for Dimgrey Cavefish and Beige Goby [ Upstream commit 0136f5844b006e2286f873457c3fcba8c45a3735 ] Correct the UMD pstate profiling clocks for Dimgrey Cavefish and Beige Goby. Signed-off-by: Evan Quan <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Qiang Yu <[email protected]> Date: Mon Feb 21 17:53:56 2022 +0800 drm/amdgpu: check vm ready by amdgpu_vm->evicting flag [ Upstream commit c1a66c3bc425ff93774fb2f6eefa67b83170dd7e ] Workstation application ANSA/META v21.1.4 get this error dmesg when running CI test suite provided by ANSA/META: [drm:amdgpu_gem_va_ioctl [amdgpu]] *ERROR* Couldn't update BO_VA (-16) This is caused by: 1. create a 256MB buffer in invisible VRAM 2. CPU map the buffer and access it causes vm_fault and try to move it to visible VRAM 3. force visible VRAM space and traverse all VRAM bos to check if evicting this bo is valuable 4. when checking a VM bo (in invisible VRAM), amdgpu_vm_evictable() will set amdgpu_vm->evicting, but latter due to not in visible VRAM, won't really evict it so not add it to amdgpu_vm->evicted 5. before next CS to clear the amdgpu_vm->evicting, user VM ops ioctl will pass amdgpu_vm_ready() (check amdgpu_vm->evicted) but fail in amdgpu_vm_bo_update_mapping() (check amdgpu_vm->evicting) and get this error log This error won't affect functionality as next CS will finish the waiting VM ops. But we'd better clear the error log by checking the amdgpu_vm->evicting flag in amdgpu_vm_ready() to stop calling amdgpu_vm_bo_update_mapping() later. Another reason is amdgpu_vm->evicted list holds all BOs (both user buffer and page table), but only page table BOs' eviction prevent VM ops. amdgpu_vm->evicting flag is set only for page table BOs, so we should use evicting flag instead of evicted list in amdgpu_vm_ready(). The side effect of this change is: previously blocked VM op (user buffer in "evicted" list but no page table in it) gets done immediately. v2: update commit comments. Acked-by: Paul Menzel <[email protected]> Reviewed-by: Christian König <[email protected]> Signed-off-by: Qiang Yu <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Cc: [email protected] Signed-off-by: Sasha Levin <[email protected]>
Author: Qiang Yu <[email protected]> Date: Tue Mar 1 14:11:59 2022 +0800 drm/amdgpu: fix suspend/resume hang regression [ Upstream commit f1ef17011c765495c876fa75435e59eecfdc1ee4 ] Regression has been reported that suspend/resume may hang with the previous vm ready check commit. So bring back the evicted list check as a temp fix. Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1922 Fixes: c1a66c3bc425 ("drm/amdgpu: check vm ready by amdgpu_vm->evicting flag") Reviewed-by: Christian König <[email protected]> Signed-off-by: Qiang Yu <[email protected]> Signed-off-by: Alex Deucher <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Douglas Anderson <[email protected]> Date: Tue Feb 22 14:18:43 2022 -0800 drm/bridge: ti-sn65dsi86: Properly undo autosuspend [ Upstream commit 26d3474348293dc752c55fe6d41282199f73714c ] The PM Runtime docs say: Drivers in ->remove() callback should undo the runtime PM changes done in ->probe(). Usually this means calling pm_runtime_disable(), pm_runtime_dont_use_autosuspend() etc. We weren't doing that for autosuspend. Let's do it. Fixes: 9bede63127c6 ("drm/bridge: ti-sn65dsi86: Use pm_runtime autosuspend") Signed-off-by: Douglas Anderson <[email protected]> Reviewed-by: Linus Walleij <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/20220222141838.1.If784ba19e875e8ded4ec4931601ce6d255845245@changeid Signed-off-by: Sasha Levin <[email protected]>
Author: Vinay Belgaumkar <[email protected]> Date: Wed Feb 16 10:15:04 2022 -0800 drm/i915/guc/slpc: Correct the param count for unset param [ Upstream commit 1b279f6ad467535c3b8a66b4edefaca2cdd5bdc3 ] SLPC unset param H2G only needs one parameter - the id of the param. Fixes: 025cb07bebfa ("drm/i915/guc/slpc: Cache platform frequency limits") Suggested-by: Umesh Nerlige Ramappa <[email protected]> Signed-off-by: Vinay Belgaumkar <[email protected]> Reviewed-by: Umesh Nerlige Ramappa <[email protected]> Signed-off-by: Ramalingam C <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit 9648f1c3739505557d94ff749a4f32192ea81fe3) Signed-off-by: Tvrtko Ursulin <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Ville Syrjälä <[email protected]> Date: Thu Feb 24 15:21:42 2022 +0200 drm/i915: s/JSP2/ICP2/ PCH commit 08783aa7693f55619859f4f63f384abf17cb58c5 upstream. This JSP2 PCH actually seems to be some special Apple specific ICP variant rather than a JSP. Make it so. Or at least all the references to it seem to be some Apple ICL machines. Didn't manage to find these PCI IDs in any public chipset docs unfortunately. The only thing we're losing here with this JSP->ICP change is Wa_14011294188, but based on the HSD that isn't actually needed on any ICP based design (including JSP), only TGP based stuff (including MCC) really need it. The documented w/a just never made that distinction because Windows didn't want to differentiate between JSP and MCC (not sure how they handle hpd/ddc/etc. then though...). Cc: [email protected] Cc: Matt Roper <[email protected]> Cc: Vivek Kasireddy <[email protected]> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/4226 Fixes: 943682e3bd19 ("drm/i915: Introduce Jasper Lake PCH") Signed-off-by: Ville Syrjälä <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Acked-by: Vivek Kasireddy <[email protected]> Tested-by: Tomas Bzatek <[email protected]> (cherry picked from commit 53581504a8e216d435f114a4f2596ad0dfd902fc) Signed-off-by: Tvrtko Ursulin <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Sasha Neftin <[email protected]> Date: Thu Feb 3 14:21:49 2022 +0200 e1000e: Correct NVM checksum verification flow commit ffd24fa2fcc76ecb2e61e7a4ef8588177bcb42a6 upstream. Update MAC type check e1000_pch_tgp because for e1000_pch_cnp, NVM checksum update is still possible. Emit a more detailed warning message. Bugzilla: https://bugzilla.opensuse.org/show_bug.cgi?id=1191663 Fixes: 4051f68318ca ("e1000e: Do not take care about recovery NVM checksum") Reported-by: Thomas Bogendoerfer <[email protected]> Signed-off-by: Sasha Neftin <[email protected]> Tested-by: Naama Meir <[email protected]> Signed-off-by: Tony Nguyen <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Sasha Neftin <[email protected]> Date: Tue Jan 25 19:31:23 2022 +0200 e1000e: Fix possible HW unit hang after an s0ix exit [ Upstream commit 1866aa0d0d6492bc2f8d22d0df49abaccf50cddd ] Disable the OEM bit/Gig Disable/restart AN impact and disable the PHY LAN connected device (LCD) reset during power management flows. This fixes possible HW unit hangs on the s0ix exit on some corporate ADL platforms. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=214821 Fixes: 3e55d231716e ("e1000e: Add handshake with the CSME to support S0ix") Suggested-by: Dima Ruinskiy <[email protected]> Suggested-by: Nir Efrati <[email protected]> Signed-off-by: Sasha Neftin <[email protected]> Tested-by: Kai-Heng Feng <[email protected]> Signed-off-by: Tony Nguyen <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Jann Horn <[email protected]> Date: Fri Feb 18 19:05:59 2022 +0100 efivars: Respect "block" flag in efivar_entry_set_safe() commit 258dd902022cb10c83671176688074879517fd21 upstream. When the "block" flag is false, the old code would sometimes still call check_var_size(), which wrongly tells ->query_variable_store() that it can block. As far as I can tell, this can't really materialize as a bug at the moment, because ->query_variable_store only does something on X86 with generic EFI, and in that configuration we always take the efivar_entry_set_nonblocking() path. Fixes: ca0e30dcaa53 ("efi: Add nonblocking option to efi_query_variable_store()") Signed-off-by: Jann Horn <[email protected]> Signed-off-by: Ard Biesheuvel <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Christophe Vu-Brugier <[email protected]> Date: Mon Nov 22 22:02:37 2021 +0900 exfat: fix i_blocks for files truncated over 4 GiB [ Upstream commit 92fba084b79e6bc7b12fc118209f1922c1a2df56 ] In exfat_truncate(), the computation of inode->i_blocks is wrong if the file is larger than 4 GiB because a 32-bit variable is used as a mask. This is fixed and simplified by using round_up(). Also fix the same buggy computation in exfat_read_root() and another (correct) one in exfat_fill_inode(). The latter was fixed another way last month but can be simplified by using round_up() as well. See: commit 0c336d6e33f4 ("exfat: fix incorrect loading of i_blocks for large files") Fixes: 98d917047e8b ("exfat: add file operations") Cc: [email protected] # v5.7+ Suggested-by: Matthew Wilcox <[email protected]> Reviewed-by: Sungjong Seo <[email protected]> Signed-off-by: Christophe Vu-Brugier <[email protected]> Signed-off-by: Namjae Jeon <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Christophe Vu-Brugier <[email protected]> Date: Tue Nov 2 22:23:58 2021 +0100 exfat: reuse exfat_inode_info variable instead of calling EXFAT_I() [ Upstream commit 7dee6f57d7f22a89dd214518c778aec448270d4c ] Also add a local "struct exfat_inode_info *ei" variable to exfat_truncate() to simplify the code. Signed-off-by: Christophe Vu-Brugier <[email protected]> Signed-off-by: Namjae Jeon <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Harshad Shirwadkar <[email protected]> Date: Thu Dec 23 12:21:38 2021 -0800 ext4: drop ineligible txn start stop APIs [ Upstream commit 7bbbe241ec7ce0def9f71464c878fdbd2b0dcf37 ] This patch drops ext4_fc_start_ineligible() and ext4_fc_stop_ineligible() APIs. Fast commit ineligible transactions should simply call ext4_fc_mark_ineligible() after starting the trasaction. Signed-off-by: Harshad Shirwadkar <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Xin Yin <[email protected]> Date: Mon Jan 17 17:36:55 2022 +0800 ext4: fast commit may miss file actions [ Upstream commit bdc8a53a6f2f0b1cb5f991440f2100732299eb93 ] in the follow scenario: 1. jbd start transaction n 2. task A get new handle for transaction n+1 3. task A do some actions and add inode to FC_Q_MAIN fc_q 4. jbd complete transaction n and clear FC_Q_MAIN fc_q 5. task A call fsync Fast commit will lost the file actions during a full commit. we should also add updates to staging queue during a full commit. and in ext4_fc_cleanup(), when reset a inode's fc track range, check it's i_sync_tid, if it bigger than current transaction tid, do not rest it, or we will lost the track range. And EXT4_MF_FC_COMMITTING is not needed anymore, so drop it. Signed-off-by: Xin Yin <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]> Cc: [email protected] Signed-off-by: Sasha Levin <[email protected]>
Author: Xin Yin <[email protected]> Date: Mon Jan 17 17:36:54 2022 +0800 ext4: fast commit may not fallback for ineligible commit [ Upstream commit e85c81ba8859a4c839bcd69c5d83b32954133a5b ] For the follow scenario: 1. jbd start commit transaction n 2. task A get new handle for transaction n+1 3. task A do some ineligible actions and mark FC_INELIGIBLE 4. jbd complete transaction n and clean FC_INELIGIBLE 5. task A call fsync In this case fast commit will not fallback to full commit and transaction n+1 also not handled by jbd. Make ext4_fc_mark_ineligible() also record transaction tid for latest ineligible case, when call ext4_fc_cleanup() check current transaction tid, if small than latest ineligible tid do not clear the EXT4_MF_FC_INELIGIBLE. Reported-by: kernel test robot <[email protected]> Reported-by: Dan Carpenter <[email protected]> Reported-by: Ritesh Harjani <[email protected]> Suggested-by: Harshad Shirwadkar <[email protected]> Signed-off-by: Xin Yin <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]> Cc: [email protected] Signed-off-by: Sasha Levin <[email protected]>
Author: Harshad Shirwadkar <[email protected]> Date: Thu Dec 23 12:21:39 2021 -0800 ext4: simplify updating of fast commit stats [ Upstream commit 0915e464cb274648e1ef1663e1356e53ff400983 ] Move fast commit stats updating logic to a separate function from ext4_fc_commit(). This significantly improves readability of ext4_fc_commit(). Signed-off-by: Harshad Shirwadkar <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Theodore Ts'o <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Alyssa Ross <[email protected]> Date: Fri Feb 11 10:27:04 2022 +0000 firmware: arm_scmi: Remove space in MODULE_ALIAS name commit 1ba603f56568c3b4c2542dfba07afa25f21dcff3 upstream. modprobe can't handle spaces in aliases. Get rid of it to fix the issue. Link: https://lore.kernel.org/r/[email protected] Fixes: aa4f886f3893 ("firmware: arm_scmi: add basic driver infrastructure for SCMI") Reviewed-by: Cristian Marussi <[email protected]> Signed-off-by: Alyssa Ross <[email protected]> Signed-off-by: Sudeep Holla <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: William Mahon <[email protected]> Date: Thu Mar 3 18:26:22 2022 -0800 HID: add mapping for KEY_ALL_APPLICATIONS commit 327b89f0acc4c20a06ed59e4d9af7f6d804dc2e2 upstream. This patch adds a new key definition for KEY_ALL_APPLICATIONS and aliases KEY_DASHBOARD to it. It also maps the 0x0c/0x2a2 usage code to KEY_ALL_APPLICATIONS. Signed-off-by: William Mahon <[email protected]> Acked-by: Benjamin Tissoires <[email protected]> Link: https://lore.kernel.org/r/20220303035618.1.I3a7746ad05d270161a18334ae06e3b6db1a1d339@changeid Signed-off-by: Dmitry Torokhov <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: William Mahon <[email protected]> Date: Thu Mar 3 18:23:42 2022 -0800 HID: add mapping for KEY_DICTATE commit bfa26ba343c727e055223be04e08f2ebdd43c293 upstream. Numerous keyboards are adding dictate keys which allows for text messages to be dictated by a microphone. This patch adds a new key definition KEY_DICTATE and maps 0x0c/0x0d8 usage code to this new keycode. Additionally hid-debug is adjusted to recognize this new usage code as well. Signed-off-by: William Mahon <[email protected]> Acked-by: Benjamin Tissoires <[email protected]> Link: https://lore.kernel.org/r/20220303021501.1.I5dbf50eb1a7a6734ee727bda4a8573358c6d3ec0@changeid Signed-off-by: Dmitry Torokhov <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Basavaraj Natikar <[email protected]> Date: Tue Feb 8 17:51:11 2022 +0530 HID: amd_sfh: Add functionality to clear interrupts [ Upstream commit fb75a3791a8032848c987db29b622878d8fe2b1c ] Newer AMD platforms with SFH may generate interrupts on some events which are unwarranted. Until this is cleared the actual MP2 data processing maybe stalled in some cases. Add a mechanism to clear the pending interrupts (if any) during the driver initialization and sensor command operations. Signed-off-by: Basavaraj Natikar <[email protected]> Signed-off-by: Jiri Kosina <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Basavaraj Natikar <[email protected]> Date: Tue Feb 8 17:51:12 2022 +0530 HID: amd_sfh: Add interrupt handler to process interrupts [ Upstream commit 7f016b35ca7623c71b31facdde080e8ce171a697 ] On newer AMD platforms with SFH, it is observed that random interrupts get generated on the SFH hardware and until this is cleared the firmware sensor processing is stalled, resulting in no data been received to driver side. Add routines to handle these interrupts, so that firmware operations are not stalled. Signed-off-by: Basavaraj Natikar <[email protected]> Signed-off-by: Jiri Kosina <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Basavaraj Natikar <[email protected]> Date: Tue Feb 8 17:51:08 2022 +0530 HID: amd_sfh: Handle amd_sfh work buffer in PM ops [ Upstream commit 0cf74235f4403b760a37f77271d2ca3424001ff9 ] Since in the current amd_sfh design the sensor data is periodically obtained in the form of poll data, during the suspend/resume cycle, scheduling a delayed work adds no value. So, cancel the work and restart back during the suspend/resume cycle respectively. Signed-off-by: Basavaraj Natikar <[email protected]> Signed-off-by: Jiri Kosina <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Eric Anholt <[email protected]> Date: Fri Feb 23 22:42:31 2018 +0100 i2c: bcm2835: Avoid clock stretching timeouts [ Upstream commit 9495b9b31abe525ebd93da58de2c88b9f66d3a0e ] The CLKT register contains at poweron 0x40, which at our typical 100kHz bus rate means .64ms. But there is no specified limit to how long devices should be able to stretch the clocks, so just disable the timeout. We still have a timeout wrapping the entire transfer. Signed-off-by: Eric Anholt <[email protected]> Signed-off-by: Stefan Wahren <[email protected]> BugLink: https://github.com/raspberrypi/linux/issues/3064 Signed-off-by: Wolfram Sang <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Wolfram Sang <[email protected]> Date: Sat Feb 12 20:45:48 2022 +0100 i2c: cadence: allow COMPILE_TEST [ Upstream commit 0b0dcb3882c8f08bdeafa03adb4487e104d26050 ] Driver builds fine with COMPILE_TEST. Enable it for wider test coverage and easier maintenance. Signed-off-by: Wolfram Sang <[email protected]> Acked-by: Michal Simek <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Wolfram Sang <[email protected]> Date: Sat Feb 12 20:46:57 2022 +0100 i2c: imx: allow COMPILE_TEST [ Upstream commit 2ce4462f2724d1b3cedccea441c6d18bb360629a ] Driver builds fine with COMPILE_TEST. Enable it for wider test coverage and easier maintenance. Signed-off-by: Wolfram Sang <[email protected]> Acked-by: Oleksij Rempel <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Wolfram Sang <[email protected]> Date: Sat Feb 12 20:47:07 2022 +0100 i2c: qup: allow COMPILE_TEST [ Upstream commit 5de717974005fcad2502281e9f82e139ca91f4bb ] Driver builds fine with COMPILE_TEST. Enable it for wider test coverage and easier maintenance. Signed-off-by: Wolfram Sang <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Jedrzej Jagielski <[email protected]> Date: Tue Jun 22 15:43:48 2021 +0200 iavf: Add trace while removing device [ Upstream commit bdb9e5c7aec73a7b8b5acab37587b6de1203e68d ] Add kernel trace that device was removed. Currently there is no such information. I.e. Host admin removes a PCI device from a VM, than on VM shall be info about the event. This patch adds info log to iavf_remove function. Signed-off-by: Arkadiusz Kubalewski <[email protected]> Signed-off-by: Jedrzej Jagielski <[email protected]> Tested-by: Konrad Jankowski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Slawomir Laba <[email protected]> Date: Wed Feb 23 13:36:56 2022 +0100 iavf: Add waiting so the port is initialized in remove [ Upstream commit 974578017fc1fdd06cea8afb9dfa32602e8529ed ] There exist races when port is being configured and remove is triggered. unregister_netdev is not and can't be called under crit_lock mutex since it is calling ndo_stop -> iavf_close which requires this lock. Depending on init state the netdev could be still unregistered so unregister_netdev never cleans up, when shortly after that the device could become registered. Make iavf_remove wait until port finishes initialization. All critical state changes are atomic (under crit_lock). Crashes that come from iavf_reset_interrupt_capability and iavf_free_traffic_irqs should now be solved in a graceful manner. Fixes: 605ca7c5c6707 ("iavf: Fix kernel BUG in free_msi_irqs") Signed-off-by: Slawomir Laba <[email protected]> Signed-off-by: Phani Burra <[email protected]> Signed-off-by: Jacob Keller <[email protected]> Signed-off-by: Mateusz Palczewski <[email protected]> Tested-by: Konrad Jankowski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Slawomir Laba <[email protected]> Date: Wed Feb 23 13:38:55 2022 +0100 iavf: Fix __IAVF_RESETTING state usage [ Upstream commit 14756b2ae265d526b8356e86729090b01778fdf6 ] The setup of __IAVF_RESETTING state in watchdog task had no effect and could lead to slow resets in the driver as the task for __IAVF_RESETTING state only requeues watchdog. Till now the __IAVF_RESETTING was interpreted by reset task as running state which could lead to errors with allocating and resources disposal. Make watchdog_task queue the reset task when it's necessary. Do not update the state to __IAVF_RESETTING so the reset task knows exactly what is the current state of the adapter. Fixes: 898ef1cb1cb2 ("iavf: Combine init and watchdog state machines") Signed-off-by: Slawomir Laba <[email protected]> Signed-off-by: Phani Burra <[email protected]> Signed-off-by: Jacob Keller <[email protected]> Signed-off-by: Mateusz Palczewski <[email protected]> Tested-by: Konrad Jankowski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Slawomir Laba <[email protected]> Date: Wed Feb 23 13:38:31 2022 +0100 iavf: Fix deadlock in iavf_reset_task commit e85ff9c631e1bf109ce8428848dfc8e8b0041f48 upstream. There exists a missing mutex_unlock call on crit_lock in iavf_reset_task call path. Unlock the crit_lock before returning from reset task. Fixes: 5ac49f3c2702 ("iavf: use mutexes for locking of critical sections") Signed-off-by: Slawomir Laba <[email protected]> Signed-off-by: Phani Burra <[email protected]> Signed-off-by: Jacob Keller <[email protected]> Signed-off-by: Mateusz Palczewski <[email protected]> Tested-by: Konrad Jankowski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Slawomir Laba <[email protected]> Date: Wed Feb 23 13:37:10 2022 +0100 iavf: Fix init state closure on remove [ Upstream commit 3ccd54ef44ebfa0792c5441b6d9c86618f3378d1 ] When init states of the adapter work, the errors like lack of communication with the PF might hop in. If such events occur the driver restores previous states in order to retry initialization in a proper way. When remove task kicks in, this situation could lead to races with unregistering the netdevice as well as resources cleanup. With the commit introducing the waiting in remove for init to complete, this problem turns into an endless waiting if init never recovers from errors. Introduce __IAVF_IN_REMOVE_TASK bit to indicate that the remove thread has started. Make __IAVF_COMM_FAILED adapter state respect the __IAVF_IN_REMOVE_TASK bit and set the __IAVF_INIT_FAILED state and return without any action instead of trying to recover. Make __IAVF_INIT_FAILED adapter state respect the __IAVF_IN_REMOVE_TASK bit and return without any further actions. Make the loop in the remove handler break when adapter has __IAVF_INIT_FAILED state set. Fixes: 898ef1cb1cb2 ("iavf: Combine init and watchdog state machines") Signed-off-by: Slawomir Laba <[email protected]> Signed-off-by: Phani Burra <[email protected]> Signed-off-by: Jacob Keller <[email protected]> Signed-off-by: Mateusz Palczewski <[email protected]> Tested-by: Konrad Jankowski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Slawomir Laba <[email protected]> Date: Wed Feb 23 13:37:50 2022 +0100 iavf: Fix locking for VIRTCHNL_OP_GET_OFFLOAD_VLAN_V2_CAPS [ Upstream commit 0579fafd37fb7efe091f0e6c8ccf968864f40f3e ] iavf_virtchnl_completion is called under crit_lock but when the code for VIRTCHNL_OP_GET_OFFLOAD_VLAN_V2_CAPS is called, this lock is released in order to obtain rtnl_lock to avoid ABBA deadlock with unregister_netdev. Along with the new way iavf_remove behaves, there exist many risks related to the lock release and attmepts to regrab it. The driver faces crashes related to races between unregister_netdev and netdev_update_features. Yet another risk is that the driver could already obtain the crit_lock in order to destroy it and iavf_virtchnl_completion could crash or block forever. Make iavf_virtchnl_completion never relock crit_lock in it's call paths. Extract rtnl_lock locking logic to the driver for unregister_netdev in order to set the netdev_registered flag inside the lock. Introduce a new flag that will inform adminq_task to perform the code from VIRTCHNL_OP_GET_OFFLOAD_VLAN_V2_CAPS right after it finishes processing messages. Guard this code with remove flags so it's never called when the driver is in remove state. Fixes: 5951a2b9812d ("iavf: Fix VLAN feature flags after VFR") Signed-off-by: Slawomir Laba <[email protected]> Signed-off-by: Phani Burra <[email protected]> Signed-off-by: Jacob Keller <[email protected]> Signed-off-by: Mateusz Palczewski <[email protected]> Tested-by: Konrad Jankowski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Slawomir Laba <[email protected]> Date: Wed Feb 23 13:38:43 2022 +0100 iavf: Fix missing check for running netdev commit d2c0f45fcceb0995f208c441d9c9a453623f9ccf upstream. The driver was queueing reset_task regardless of the netdev state. Do not queue the reset task in iavf_change_mtu if netdev is not running. Fixes: fdd4044ffdc8 ("iavf: Remove timer for work triggering, use delaying work instead") Signed-off-by: Slawomir Laba <[email protected]> Signed-off-by: Phani Burra <[email protected]> Signed-off-by: Jacob Keller <[email protected]> Signed-off-by: Mateusz Palczewski <[email protected]> Tested-by: Konrad Jankowski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Slawomir Laba <[email protected]> Date: Wed Feb 23 13:38:01 2022 +0100 iavf: Fix race in init state [ Upstream commit a472eb5cbaebb5774672c565e024336c039e9128 ] When iavf_init_version_check sends VIRTCHNL_OP_GET_VF_RESOURCES message, the driver will wait for the response after requeueing the watchdog task in iavf_init_get_resources call stack. The logic is implemented this way that iavf_init_get_resources has to be called in order to allocate adapter->vf_res. It is polling for the AQ response in iavf_get_vf_config function. Expect a call trace from kernel when adminq_task worker handles this message first. adapter->vf_res will be NULL in iavf_virtchnl_completion. Make the watchdog task not queue the adminq_task if the init process is not finished yet. Fixes: 898ef1cb1cb2 ("iavf: Combine init and watchdog state machines") Signed-off-by: Slawomir Laba <[email protected]> Signed-off-by: Phani Burra <[email protected]> Signed-off-by: Jacob Keller <[email protected]> Signed-off-by: Mateusz Palczewski <[email protected]> Tested-by: Konrad Jankowski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Slawomir Laba <[email protected]> Date: Wed Feb 23 13:35:49 2022 +0100 iavf: Rework mutexes for better synchronisation [ Upstream commit fc2e6b3b132a907378f6af08356b105a4139c4fb ] The driver used to crash in multiple spots when put to stress testing of the init, reset and remove paths. The user would experience call traces or hangs when creating, resetting, removing VFs. Depending on the machines, the call traces are happening in random spots, like reset restoring resources racing with driver remove. Make adapter->crit_lock mutex a mandatory lock for guarding the operations performed on all workqueues and functions dealing with resource allocation and disposal. Make __IAVF_REMOVE a final state of the driver respected by workqueues that shall not requeue, when they fail to obtain the crit_lock. Make the IRQ handler not to queue the new work for adminq_task when the __IAVF_REMOVE state is set. Fixes: 5ac49f3c2702 ("iavf: use mutexes for locking of critical sections") Signed-off-by: Slawomir Laba <[email protected]> Signed-off-by: Phani Burra <[email protected]> Signed-off-by: Jacob Keller <[email protected]> Signed-off-by: Mateusz Palczewski <[email protected]> Tested-by: Konrad Jankowski <[email protected]> Signed-off-by: Tony Nguyen <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Sukadev Bhattiprolu <[email protected]> Date: Thu Feb 24 22:23:58 2022 -0800 ibmvnic: Allow queueing resets during probe [ Upstream commit fd98693cb0721317f27341951593712c580c36a1 ] We currently don't allow queuing resets when adapter is in VNIC_PROBING state - instead we throw away the reset and return EBUSY. The reasoning is probably that during ibmvnic_probe() the ibmvnic_adapter itself is being initialized so performing a reset during this time can lead us to accessing fields in the ibmvnic_adapter that are not fully initialized. A review of the code shows that all the adapter state neede to process a reset is initialized before registering the CRQ so that should no longer be a concern. Further the expectation is that if we do get a reset (transport event) during probe, the do..while() loop in ibmvnic_probe() will handle this by reinitializing the CRQ. While that is true to some extent, it is possible that the reset might occur _after_ the CRQ is registered and CRQ_INIT message was exchanged but _before_ the adapter state is set to VNIC_PROBED. As mentioned above, such a reset will be thrown away. While the client assumes that the adapter is functional, the vnic server will wait for the client to reinit the adapter. This disconnect between the two leaves the adapter down needing manual intervention. Because ibmvnic_probe() has other work to do after initializing the CRQ (such as registering the netdev at a minimum) and because the reset event can occur at any instant after the CRQ is initialized, there will always be a window between initializing the CRQ and considering the adapter ready for resets (ie state == PROBED). So rather than discarding resets during this window, allow queueing them - but only process them after the adapter is fully initialized. To do this, introduce a new completion state ->probe_done and have the reset worker thread wait on this before processing resets. This change brings up two new situations in or just after ibmvnic_probe(). First after one or more resets were queued, we encounter an error and decide to retry the initialization. At that point the queued resets are no longer relevant since we could be talking to a new vnic server. So we must purge/flush the queued resets before restarting the initialization. As a side note, since we are still in the probing stage and we have not registered the netdev, it will not be CHANGE_PARAM reset. Second this change opens up a potential race between the worker thread in __ibmvnic_reset(), the tasklet and the ibmvnic_open() due to the following sequence of events: 1. Register CRQ 2. Get transport event before CRQ_INIT completes. 3. Tasklet schedules reset: a) add rwi to list b) schedule_work() to start worker thread which runs and waits for ->probe_done. 4. ibmvnic_probe() decides to retry, purges rwi_list 5. Re-register crq and this time rest of probe succeeds - register netdev and complete(->probe_done). 6. Worker thread resumes in __ibmvnic_reset() from 3b. 7. Worker thread sets ->resetting bit 8. ibmvnic_open() comes in, notices ->resetting bit, sets state to IBMVNIC_OPEN and returns early expecting worker thread to finish the open. 9. Worker thread finds rwi_list empty and returns without opening the interface. If this happens, the ->ndo_open() call is effectively lost and the interface remains down. To address this, ensure that ->rwi_list is not empty before setting the ->resetting bit. See also comments in __ibmvnic_reset(). Fixes: 6a2fb0e99f9c ("ibmvnic: driver initialization for kdump/kexec") Signed-off-by: Sukadev Bhattiprolu <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Sukadev Bhattiprolu <[email protected]> Date: Thu Feb 24 22:23:57 2022 -0800 ibmvnic: clear fop when retrying probe [ Upstream commit f628ad531b4f34fdba0984255b4a2850dd369513 ] Clear ->failover_pending flag that may have been set in the previous pass of registering CRQ. If we don't clear, a subsequent ibmvnic_open() call would be misled into thinking a failover is pending and assuming that the reset worker thread would open the adapter. If this pass of registering the CRQ succeeds (i.e there is no transport event), there wouldn't be a reset worker thread. This would leave the adapter unconfigured and require manual intervention to bring it up during boot. Fixes: 5a18e1e0c193 ("ibmvnic: Fix failover case for non-redundant configuration") Signed-off-by: Sukadev Bhattiprolu <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Sukadev Bhattiprolu <[email protected]> Date: Thu Feb 24 22:23:54 2022 -0800 ibmvnic: complete init_done on transport events [ Upstream commit 36491f2df9ad2501e5a4ec25d3d95d72bafd2781 ] If we get a transport event, set the error and mark the init as complete so the attempt to send crq-init or login fail sooner rather than wait for the timeout. Fixes: bbd669a868bb ("ibmvnic: Fix completion structure initialization") Signed-off-by: Sukadev Bhattiprolu <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Sukadev Bhattiprolu <[email protected]> Date: Thu Feb 24 22:23:53 2022 -0800 ibmvnic: define flush_reset_queue helper [ Upstream commit 83da53f7e4bd86dca4b2edc1e2bb324fb3c033a1 ] Define and use a helper to flush the reset queue. Fixes: 2770a7984db5 ("ibmvnic: Introduce hard reset recovery") Signed-off-by: Sukadev Bhattiprolu <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Sukadev Bhattiprolu <[email protected]> Date: Thu Feb 24 22:23:51 2022 -0800 ibmvnic: free reset-work-item when flushing commit 8d0657f39f487d904fca713e0bc39c2707382553 upstream. Fix a tiny memory leak when flushing the reset work queue. Fixes: 2770a7984db5 ("ibmvnic: Introduce hard reset recovery") Signed-off-by: Sukadev Bhattiprolu <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Sukadev Bhattiprolu <[email protected]> Date: Thu Feb 24 22:23:56 2022 -0800 ibmvnic: init init_done_rc earlier [ Upstream commit ae16bf15374d8b055e040ac6f3f1147ab1c9bb7d ] We currently initialize the ->init_done completion/return code fields before issuing a CRQ_INIT command. But if we get a transport event soon after registering the CRQ the taskslet may already have recorded the completion and error code. If we initialize here, we might overwrite/ lose that and end up issuing the CRQ_INIT only to timeout later. If that timeout happens during probe, we will leave the adapter in the DOWN state rather than retrying to register/init the CRQ. Initialize the completion before registering the CRQ so we don't lose the notification. Fixes: 032c5e82847a ("Driver for IBM System i/p VNIC protocol") Signed-off-by: Sukadev Bhattiprolu <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Sukadev Bhattiprolu <[email protected]> Date: Thu Feb 24 22:23:52 2022 -0800 ibmvnic: initialize rc before completing wait [ Upstream commit 765559b10ce514eb1576595834f23cdc92125fee ] We should initialize ->init_done_rc before calling complete(). Otherwise the waiting thread may see ->init_done_rc as 0 before we have updated it and may assume that the CRQ was successful. Fixes: 6b278c0cb378 ("ibmvnic delay complete()") Signed-off-by: Sukadev Bhattiprolu <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Sukadev Bhattiprolu <[email protected]> Date: Thu Feb 24 22:23:55 2022 -0800 ibmvnic: register netdev after init of adapter commit 570425f8c7c18b14fa8a2a58a0adb431968ad118 upstream. Finish initializing the adapter before registering netdev so state is consistent. Fixes: c26eba03e407 ("ibmvnic: Update reset infrastructure to support tunable parameters") Signed-off-by: Sukadev Bhattiprolu <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Dany Madden <[email protected]> Date: Tue Dec 14 00:17:47 2021 -0500 ibmvnic: Update driver return codes [ Upstream commit b6ee566cf3940883d67c0d142fae8d410e975f47 ] Update return codes to be more informative. Signed-off-by: Jacob Root <[email protected]> Signed-off-by: Dany Madden <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Corinna Vinschen <[email protected]> Date: Wed Feb 16 14:31:35 2022 +0100 igc: igc_read_phy_reg_gpy: drop premature return commit fda2635466cd26ad237e1bc5d3f6a60f97ad09b6 upstream. igc_read_phy_reg_gpy checks the return value from igc_read_phy_reg_mdic and if it's not 0, returns immediately. By doing this, it leaves the HW semaphore in the acquired state. Drop this premature return statement, the function returns after releasing the semaphore immediately anyway. Fixes: 5586838fe9ce ("igc: Add code for PHY support") Signed-off-by: Corinna Vinschen <[email protected]> Acked-by: Sasha Neftin <[email protected]> Tested-by: Naama Meir <[email protected]> Signed-off-by: Tony Nguyen <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Sasha Neftin <[email protected]> Date: Sun Feb 20 09:29:15 2022 +0200 igc: igc_write_phy_reg_gpy: drop premature return commit c4208653a327a09da1e9e7b10299709b6d9b17bf upstream. Similar to "igc_read_phy_reg_gpy: drop premature return" patch. igc_write_phy_reg_gpy checks the return value from igc_write_phy_reg_mdic and if it's not 0, returns immediately. By doing this, it leaves the HW semaphore in the acquired state. Drop this premature return statement, the function returns after releasing the semaphore immediately anyway. Fixes: 5586838fe9ce ("igc: Add code for PHY support") Suggested-by: Dima Ruinskiy <[email protected]> Reported-by: Corinna Vinschen <[email protected]> Signed-off-by: Sasha Neftin <[email protected]> Tested-by: Naama Meir <[email protected]> Signed-off-by: Tony Nguyen <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: José Expósito <[email protected]> Date: Tue Feb 8 09:59:16 2022 -0800 Input: clear BTN_RIGHT/MIDDLE on buttonpads [ Upstream commit 37ef4c19b4c659926ce65a7ac709ceaefb211c40 ] Buttonpads are expected to map the INPUT_PROP_BUTTONPAD property bit and the BTN_LEFT key bit. As explained in the specification, where a device has a button type value of 0 (click-pad) or 1 (pressure-pad) there should not be discrete buttons: https://docs.microsoft.com/en-us/windows-hardware/design/component-guidelines/touchpad-windows-precision-touchpad-collection#device-capabilities-feature-report However, some drivers map the BTN_RIGHT and/or BTN_MIDDLE key bits even though the device is a buttonpad and therefore does not have those buttons. This behavior has forced userspace applications like libinput to implement different workarounds and quirks to detect buttonpads and offer to the user the right set of features and configuration options. For more information: https://gitlab.freedesktop.org/libinput/libinput/-/merge_requests/726 In order to avoid this issue clear the BTN_RIGHT and BTN_MIDDLE key bits when the input device is register if the INPUT_PROP_BUTTONPAD property bit is set. Notice that this change will not affect udev because it does not check for buttons. See systemd/src/udev/udev-builtin-input_id.c. List of known affected hardware: - Chuwi AeroBook Plus - Chuwi Gemibook - Framework Laptop - GPD Win Max - Huawei MateBook 2020 - Prestigio Smartbook 141 C2 - Purism Librem 14v1 - StarLite Mk II - AMI firmware - StarLite Mk II - Coreboot firmware - StarLite Mk III - AMI firmware - StarLite Mk III - Coreboot firmware - StarLabTop Mk IV - AMI firmware - StarLabTop Mk IV - Coreboot firmware - StarBook Mk V Acked-by: Peter Hutterer <[email protected]> Acked-by: Benjamin Tissoires <[email protected]> Acked-by: Jiri Kosina <[email protected]> Signed-off-by: José Expósito <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Dmitry Torokhov <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Hans de Goede <[email protected]> Date: Mon Feb 28 23:39:50 2022 -0800 Input: elan_i2c - fix regulator enable count imbalance after suspend/resume commit 04b7762e37c95d9b965d16bb0e18dbd1fa2e2861 upstream. Before these changes elan_suspend() would only disable the regulator when device_may_wakeup() returns false; whereas elan_resume() would unconditionally enable it, leading to an enable count imbalance when device_may_wakeup() returns true. This triggers the "WARN_ON(regulator->enable_count)" in regulator_put() when the elan_i2c driver gets unbound, this happens e.g. with the hot-plugable dock with Elan I2C touchpad for the Asus TF103C 2-in-1. Fix this by making the regulator_enable() call also be conditional on device_may_wakeup() returning false. Signed-off-by: Hans de Goede <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Dmitry Torokhov <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Hans de Goede <[email protected]> Date: Mon Feb 28 23:39:38 2022 -0800 Input: elan_i2c - move regulator_[en|dis]able() out of elan_[en|dis]able_power() commit 81a36d8ce554b82b0a08e2b95d0bd44fcbff339b upstream. elan_disable_power() is called conditionally on suspend, where as elan_enable_power() is always called on resume. This leads to an imbalance in the regulator's enable count. Move the regulator_[en|dis]able() calls out of elan_[en|dis]able_power() in preparation of fixing this. No functional changes intended. Signed-off-by: Hans de Goede <[email protected]> Link: https://lore.kernel.org/r/[email protected] [dtor: consolidate elan_[en|dis]able() into elan_set_power()] Signed-off-by: Dmitry Torokhov <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: David Gow <[email protected]> Date: Sun Feb 27 21:00:10 2022 -0800 Input: samsung-keypad - properly state IOMEM dependency commit ba115adf61b36b8c167126425a62b0efc23f72c0 upstream. Make the samsung-keypad driver explicitly depend on CONFIG_HAS_IOMEM, as it calls devm_ioremap(). This prevents compile errors in some configs (e.g, allyesconfig/randconfig under UML): /usr/bin/ld: drivers/input/keyboard/samsung-keypad.o: in function `samsung_keypad_probe': samsung-keypad.c:(.text+0xc60): undefined reference to `devm_ioremap' Signed-off-by: David Gow <[email protected]> Acked-by: anton ivanov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Dmitry Torokhov <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Suravee Suthikulpanit <[email protected]> Date: Thu Feb 10 09:47:45 2022 -0600 iommu/amd: Fix I/O page table memory leak [ Upstream commit 6b0b2d9a6a308bcd9300c2d83000a82812c56cea ] The current logic updates the I/O page table mode for the domain before calling the logic to free memory used for the page table. This results in IOMMU page table memory leak, and can be observed when launching VM w/ pass-through devices. Fix by freeing the memory used for page table before updating the mode. Cc: Joerg Roedel <[email protected]> Reported-by: Daniel Jordan <[email protected]> Tested-by: Daniel Jordan <[email protected]> Signed-off-by: Suravee Suthikulpanit <[email protected]> Fixes: e42ba0633064 ("iommu/amd: Restructure code for freeing page table") Link: https://lore.kernel.org/all/[email protected]/ Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Lennert Buytenhek <[email protected]> Date: Mon Oct 4 13:07:24 2021 +0300 iommu/amd: Recover from event log overflow commit 5ce97f4ec5e0f8726a5dda1710727b1ee9badcac upstream. The AMD IOMMU logs I/O page faults and such to a ring buffer in system memory, and this ring buffer can overflow. The AMD IOMMU spec has the following to say about the interrupt status bit that signals this overflow condition: EventOverflow: Event log overflow. RW1C. Reset 0b. 1 = IOMMU event log overflow has occurred. This bit is set when a new event is to be written to the event log and there is no usable entry in the event log, causing the new event information to be discarded. An interrupt is generated when EventOverflow = 1b and MMIO Offset 0018h[EventIntEn] = 1b. No new event log entries are written while this bit is set. Software Note: To resume logging, clear EventOverflow (W1C), and write a 1 to MMIO Offset 0018h[EventLogEn]. The AMD IOMMU driver doesn't currently implement this recovery sequence, meaning that if a ring buffer overflow occurs, logging of EVT/PPR/GA events will cease entirely. This patch implements the spec-mandated reset sequence, with the minor tweak that the hardware seems to want to have a 0 written to MMIO Offset 0018h[EventLogEn] first, before writing an 1 into this field, or the IOMMU won't actually resume logging events. Signed-off-by: Lennert Buytenhek <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Miaoqian Lin <[email protected]> Date: Fri Jan 7 08:09:11 2022 +0000 iommu/tegra-smmu: Fix missing put_device() call in tegra_smmu_find commit 9826e393e4a8c3df474e7f9eacd3087266f74005 upstream. The reference taken by 'of_find_device_by_node()' must be released when not needed anymore. Add the corresponding 'put_device()' in the error handling path. Fixes: 765a9d1d02b2 ("iommu/tegra-smmu: Fix mc errors on tegra124-nyan") Signed-off-by: Miaoqian Lin <[email protected]> Acked-by: Thierry Reding <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Adrian Huang <[email protected]> Date: Mon Feb 21 13:33:48 2022 +0800 iommu/vt-d: Fix double list_add when enabling VMD in scalable mode commit b00833768e170a31af09268f7ab96aecfcca9623 upstream. When enabling VMD and IOMMU scalable mode, the following kernel panic call trace/kernel log is shown in Eagle Stream platform (Sapphire Rapids CPU) during booting: pci 0000:59:00.5: Adding to iommu group 42 ... vmd 0000:59:00.5: PCI host bridge to bus 10000:80 pci 10000:80:01.0: [8086:352a] type 01 class 0x060400 pci 10000:80:01.0: reg 0x10: [mem 0x00000000-0x0001ffff 64bit] pci 10000:80:01.0: enabling Extended Tags pci 10000:80:01.0: PME# supported from D0 D3hot D3cold pci 10000:80:01.0: DMAR: Setup RID2PASID failed pci 10000:80:01.0: Failed to add to iommu group 42: -16 pci 10000:80:03.0: [8086:352b] type 01 class 0x060400 pci 10000:80:03.0: reg 0x10: [mem 0x00000000-0x0001ffff 64bit] pci 10000:80:03.0: enabling Extended Tags pci 10000:80:03.0: PME# supported from D0 D3hot D3cold ------------[ cut here ]------------ kernel BUG at lib/list_debug.c:29! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI CPU: 0 PID: 7 Comm: kworker/0:1 Not tainted 5.17.0-rc3+ #7 Hardware name: Lenovo ThinkSystem SR650V3/SB27A86647, BIOS ESE101Y-1.00 01/13/2022 Workqueue: events work_for_cpu_fn RIP: 0010:__list_add_valid.cold+0x26/0x3f Code: 9a 4a ab ff 4c 89 c1 48 c7 c7 40 0c d9 9e e8 b9 b1 fe ff 0f 0b 48 89 f2 4c 89 c1 48 89 fe 48 c7 c7 f0 0c d9 9e e8 a2 b1 fe ff <0f> 0b 48 89 d1 4c 89 c6 4c 89 ca 48 c7 c7 98 0c d9 9e e8 8b b1 fe RSP: 0000:ff5ad434865b3a40 EFLAGS: 00010246 RAX: 0000000000000058 RBX: ff4d61160b74b880 RCX: ff4d61255e1fffa8 RDX: 0000000000000000 RSI: 00000000fffeffff RDI: ffffffff9fd34f20 RBP: ff4d611d8e245c00 R08: 0000000000000000 R09: ff5ad434865b3888 R10: ff5ad434865b3880 R11: ff4d61257fdc6fe8 R12: ff4d61160b74b8a0 R13: ff4d61160b74b8a0 R14: ff4d611d8e245c10 R15: ff4d611d8001ba70 FS: 0000000000000000(0000) GS:ff4d611d5ea00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ff4d611fa1401000 CR3: 0000000aa0210001 CR4: 0000000000771ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> intel_pasid_alloc_table+0x9c/0x1d0 dmar_insert_one_dev_info+0x423/0x540 ? device_to_iommu+0x12d/0x2f0 intel_iommu_attach_device+0x116/0x290 __iommu_attach_device+0x1a/0x90 iommu_group_add_device+0x190/0x2c0 __iommu_probe_device+0x13e/0x250 iommu_probe_device+0x24/0x150 iommu_bus_notifier+0x69/0x90 blocking_notifier_call_chain+0x5a/0x80 device_add+0x3db/0x7b0 ? arch_memremap_can_ram_remap+0x19/0x50 ? memremap+0x75/0x140 pci_device_add+0x193/0x1d0 pci_scan_single_device+0xb9/0xf0 pci_scan_slot+0x4c/0x110 pci_scan_child_bus_extend+0x3a/0x290 vmd_enable_domain.constprop.0+0x63e/0x820 vmd_probe+0x163/0x190 local_pci_probe+0x42/0x80 work_for_cpu_fn+0x13/0x20 process_one_work+0x1e2/0x3b0 worker_thread+0x1c4/0x3a0 ? rescuer_thread+0x370/0x370 kthread+0xc7/0xf0 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- ... Kernel panic - not syncing: Fatal exception Kernel Offset: 0x1ca00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) ---[ end Kernel panic - not syncing: Fatal exception ]--- The following 'lspci' output shows devices '10000:80:*' are subdevices of the VMD device 0000:59:00.5: $ lspci ... 0000:59:00.5 RAID bus controller: Intel Corporation Volume Management Device NVMe RAID Controller (rev 20) ... 10000:80:01.0 PCI bridge: Intel Corporation Device 352a (rev 03) 10000:80:03.0 PCI bridge: Intel Corporation Device 352b (rev 03) 10000:80:05.0 PCI bridge: Intel Corporation Device 352c (rev 03) 10000:80:07.0 PCI bridge: Intel Corporation Device 352d (rev 03) 10000:81:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD [3DNAND, Beta Rock Controller] 10000:82:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD [3DNAND, Beta Rock Controller] The symptom 'list_add double add' is caused by the following failure message: pci 10000:80:01.0: DMAR: Setup RID2PASID failed pci 10000:80:01.0: Failed to add to iommu group 42: -16 pci 10000:80:03.0: [8086:352b] type 01 class 0x060400 Device 10000:80:01.0 is the subdevice of the VMD device 0000:59:00.5, so invoking intel_pasid_alloc_table() gets the pasid_table of the VMD device 0000:59:00.5. Here is call path: intel_pasid_alloc_table pci_for_each_dma_alias get_alias_pasid_table search_pasid_table pci_real_dma_dev() in pci_for_each_dma_alias() gets the real dma device which is the VMD device 0000:59:00.5. However, pte of the VMD device 0000:59:00.5 has been configured during this message "pci 0000:59:00.5: Adding to iommu group 42". So, the status -EBUSY is returned when configuring pasid entry for device 10000:80:01.0. It then invokes dmar_remove_one_dev_info() to release 'struct device_domain_info *' from iommu_devinfo_cache. But, the pasid table is not released because of the following statement in __dmar_remove_one_dev_info(): if (info->dev && !dev_is_real_dma_subdevice(info->dev)) { ... intel_pasid_free_table(info->dev); } The subsequent dmar_insert_one_dev_info() operation of device 10000:80:03.0 allocates 'struct device_domain_info *' from iommu_devinfo_cache. The allocated address is the same address that is released previously for device 10000:80:01.0. Finally, invoking device_attach_pasid_table() causes the issue. `git bisect` points to the offending commit 474dd1c65064 ("iommu/vt-d: Fix clearing real DMA device's scalable-mode context entries"), which releases the pasid table if the device is not the subdevice by checking the returned status of dev_is_real_dma_subdevice(). Reverting the offending commit can work around the issue. The solution is to prevent from allocating pasid table if those devices are subdevices of the VMD device. Fixes: 474dd1c65064 ("iommu/vt-d: Fix clearing real DMA device's scalable-mode context entries") Cc: [email protected] # v5.14+ Signed-off-by: Adrian Huang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Lu Baolu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Eric Dumazet <[email protected]> Date: Thu Mar 3 09:37:28 2022 -0800 ipv6: fix skb drops in igmp6_event_query() and igmp6_event_report() [ Upstream commit 2d3916f3189172d5c69d33065c3c21119fe539fc ] While investigating on why a synchronize_net() has been added recently in ipv6_mc_down(), I found that igmp6_event_query() and igmp6_event_report() might drop skbs in some cases. Discussion about removing synchronize_net() from ipv6_mc_down() will happen in a different thread. Fixes: f185de28d9ae ("mld: add new workqueues for process mld events") Signed-off-by: Eric Dumazet <[email protected]> Cc: Taehee Yoo <[email protected]> Cc: Cong Wang <[email protected]> Cc: David Ahern <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Randy Dunlap <[email protected]> Date: Tue Feb 22 19:06:30 2022 -0800 iwlwifi: mvm: check debugfs_dir ptr before use commit 5a6248c0a22352f09ea041665d3bd3e18f6f872c upstream. When "debugfs=off" is used on the kernel command line, iwiwifi's mvm module uses an invalid/unchecked debugfs_dir pointer and causes a BUG: BUG: kernel NULL pointer dereference, address: 000000000000004f #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP CPU: 1 PID: 503 Comm: modprobe Tainted: G W 5.17.0-rc5 #7 Hardware name: Dell Inc. Inspiron 15 5510/076F7Y, BIOS 2.4.1 11/05/2021 RIP: 0010:iwl_mvm_dbgfs_register+0x692/0x700 [iwlmvm] Code: 69 a0 be 80 01 00 00 48 c7 c7 50 73 6a a0 e8 95 cf ee e0 48 8b 83 b0 1e 00 00 48 c7 c2 54 73 6a a0 be 64 00 00 00 48 8d 7d 8c <48> 8b 48 50 e8 15 22 07 e1 48 8b 43 28 48 8d 55 8c 48 c7 c7 5f 73 RSP: 0018:ffffc90000a0ba68 EFLAGS: 00010246 RAX: ffffffffffffffff RBX: ffff88817d6e3328 RCX: ffff88817d6e3328 RDX: ffffffffa06a7354 RSI: 0000000000000064 RDI: ffffc90000a0ba6c RBP: ffffc90000a0bae0 R08: ffffffff824e4880 R09: ffffffffa069d620 R10: ffffc90000a0ba00 R11: ffffffffffffffff R12: 0000000000000000 R13: ffffc90000a0bb28 R14: ffff88817d6e3328 R15: ffff88817d6e3320 FS: 00007f64dd92d740(0000) GS:ffff88847f640000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000000004f CR3: 000000016fc79001 CR4: 0000000000770ee0 PKRU: 55555554 Call Trace: <TASK> ? iwl_mvm_mac_setup_register+0xbdc/0xda0 [iwlmvm] iwl_mvm_start_post_nvm+0x71/0x100 [iwlmvm] iwl_op_mode_mvm_start+0xab8/0xb30 [iwlmvm] _iwl_op_mode_start+0x6f/0xd0 [iwlwifi] iwl_opmode_register+0x6a/0xe0 [iwlwifi] ? 0xffffffffa0231000 iwl_mvm_init+0x35/0x1000 [iwlmvm] ? 0xffffffffa0231000 do_one_initcall+0x5a/0x1b0 ? kmem_cache_alloc+0x1e5/0x2f0 ? do_init_module+0x1e/0x220 do_init_module+0x48/0x220 load_module+0x2602/0x2bc0 ? __kernel_read+0x145/0x2e0 ? kernel_read_file+0x229/0x290 __do_sys_finit_module+0xc5/0x130 ? __do_sys_finit_module+0xc5/0x130 __x64_sys_finit_module+0x13/0x20 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f64dda564dd Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1b 29 0f 00 f7 d8 64 89 01 48 RSP: 002b:00007ffdba393f88 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f64dda564dd RDX: 0000000000000000 RSI: 00005575399e2ab2 RDI: 0000000000000001 RBP: 000055753a91c5e0 R08: 0000000000000000 R09: 0000000000000002 R10: 0000000000000001 R11: 0000000000000246 R12: 00005575399e2ab2 R13: 000055753a91ceb0 R14: 0000000000000000 R15: 000055753a923018 </TASK> Modules linked in: btintel(+) btmtk bluetooth vfat snd_hda_codec_hdmi fat snd_hda_codec_realtek snd_hda_codec_generic iwlmvm(+) snd_sof_pci_intel_tgl mac80211 snd_sof_intel_hda_common soundwire_intel soundwire_generic_allocation soundwire_cadence soundwire_bus snd_sof_intel_hda snd_sof_pci snd_sof snd_sof_xtensa_dsp snd_soc_hdac_hda snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi snd_soc_core btrfs snd_compress snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec raid6_pq iwlwifi snd_hda_core snd_pcm snd_timer snd soundcore cfg80211 intel_ish_ipc(+) thunderbolt rfkill intel_ishtp ucsi_acpi wmi i2c_hid_acpi i2c_hid evdev CR2: 000000000000004f ---[ end trace 0000000000000000 ]--- Check the debugfs_dir pointer for an error before using it. Fixes: 8c082a99edb9 ("iwlwifi: mvm: simplify iwl_mvm_dbgfs_register") Signed-off-by: Randy Dunlap <[email protected]> Cc: Luca Coelho <[email protected]> Cc: [email protected] Cc: Kalle Valo <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Emmanuel Grumbach <[email protected]> Cc: stable <[email protected]> Reviewed-by: Greg Kroah-Hartman <[email protected]> Link: https://lore.kernel.org/r/[email protected] [change to make both conditional] Signed-off-by: Johannes Berg <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Maciej Fijalkowski <[email protected]> Date: Wed Mar 2 09:59:27 2022 -0800 ixgbe: xsk: change !netif_carrier_ok() handling in ixgbe_xmit_zc() commit 6c7273a266759d9d36f7c862149f248bcdeddc0f upstream. Commit c685c69fba71 ("ixgbe: don't do any AF_XDP zero-copy transmit if netif is not OK") addressed the ring transient state when MEM_TYPE_XSK_BUFF_POOL was being configured which in turn caused the interface to through down/up. Maurice reported that when carrier is not ok and xsk_pool is present on ring pair, ksoftirqd will consume 100% CPU cycles due to the constant NAPI rescheduling as ixgbe_poll() states that there is still some work to be done. To fix this, do not set work_done to false for a !netif_carrier_ok(). Fixes: c685c69fba71 ("ixgbe: don't do any AF_XDP zero-copy transmit if netif is not OK") Reported-by: Maurice Baijens <[email protected]> Tested-by: Maurice Baijens <[email protected]> Signed-off-by: Maciej Fijalkowski <[email protected]> Tested-by: Sandeep Penigalapati <[email protected]> Signed-off-by: Tony Nguyen <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Marc Zyngier <[email protected]> Date: Thu Feb 3 09:24:45 2022 +0000 KVM: arm64: vgic: Read HW interrupt pending state from the HW [ Upstream commit 5bfa685e62e9ba93c303a9a8db646c7228b9b570 ] It appears that a read access to GIC[DR]_I[CS]PENDRn doesn't always result in the pending interrupts being accurately reported if they are mapped to a HW interrupt. This is particularily visible when acking the timer interrupt and reading the GICR_ISPENDR1 register immediately after, for example (the interrupt appears as not-pending while it really is...). This is because a HW interrupt has its 'active and pending state' kept in the *physical* distributor, and not in the virtual one, as mandated by the spec (this is what allows the direct deactivation). The virtual distributor only caries the pending and active *states* (note the plural, as these are two independent and non-overlapping states). Fix it by reading the HW state back, either from the timer itself or from the distributor if necessary. Reported-by: Ricardo Koller <[email protected]> Tested-by: Ricardo Koller <[email protected]> Reviewed-by: Ricardo Koller <[email protected]> Signed-off-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sasha Levin <[email protected]>
Author: James Morse <[email protected]> Date: Thu Jan 27 12:20:52 2022 +0000 KVM: arm64: Workaround Cortex-A510's single-step and PAC trap errata [ Upstream commit 1dd498e5e26ad71e3e9130daf72cfb6a693fee03 ] Cortex-A510's erratum #2077057 causes SPSR_EL2 to be corrupted when single-stepping authenticated ERET instructions. A single step is expected, but a pointer authentication trap is taken instead. The erratum causes SPSR_EL1 to be copied to SPSR_EL2, which could allow EL1 to cause a return to EL2 with a guest controlled ELR_EL2. Because the conditions require an ERET into active-not-pending state, this is only a problem for the EL2 when EL2 is stepping EL1. In this case the previous SPSR_EL2 value is preserved in struct kvm_vcpu, and can be restored. Cc: [email protected] # 53960faf2b73: arm64: Add Cortex-A510 CPU part definition Cc: [email protected] Signed-off-by: James Morse <[email protected]> [maz: fixup cpucaps ordering] Signed-off-by: Marc Zyngier <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sasha Levin <[email protected]>
Author: Like Xu <[email protected]> Date: Tue Mar 1 20:49:41 2022 +0800 KVM: x86/mmu: Passing up the error state of mmu_alloc_shadow_roots() commit c6c937d673aaa1d603f62f134e1ca9c173eeeed3 upstream. Just like on the optional mmu_alloc_direct_roots() path, once shadow path reaches "r = -EIO" somewhere, the caller needs to know the actual state in order to enter error handling and avoid something worse. Fixes: 4a38162ee9f1 ("KVM: MMU: load PDPTRs outside mmu_lock") Signed-off-by: Like Xu <[email protected]> Reviewed-by: Sean Christopherson <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Aaron Lewis <[email protected]> Date: Mon Feb 14 21:29:51 2022 +0000 KVM: x86: Add KVM_CAP_ENABLE_CAP to x86 [ Upstream commit 127770ac0d043435375ab86434f31a93efa88215 ] Follow the precedent set by other architectures that support the VCPU ioctl, KVM_ENABLE_CAP, and advertise the VM extension, KVM_CAP_ENABLE_CAP. This way, userspace can ensure that KVM_ENABLE_CAP is available on a vcpu before using it. Fixes: 5c919412fe61 ("kvm/x86: Hyper-V synthetic interrupt controller") Signed-off-by: Aaron Lewis <[email protected]> Message-Id: <[email protected]> Cc: [email protected] Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Greg Kroah-Hartman <[email protected]> Date: Tue Mar 8 19:14:20 2022 +0100 Linux 5.16.13 Link: https://lore.kernel.org/r/[email protected] Tested-by: Fox Chen <[email protected]> Tested-by: Ronald Warsow <[email protected]> Link: https://lore.kernel.org/r/[email protected] Tested-by: Florian Fainelli <[email protected]> Tested-by: Shuah Khan <[email protected]> Tested-by: Fox Chen <[email protected]> Tested-by: Linux Kernel Functional Testing <[email protected]> Tested-by: Jon Hunter <[email protected]> Tested-by: Rudi Heitbaum <[email protected]> Tested-by: Justin M. Forbes <[email protected]> Tested-by: Bagas Sanjaya <[email protected]> Tested-by: Luna Jernberg <[email protected]> Tested-by: Ron Economos <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Deren Wu <[email protected]> Date: Sun Feb 13 00:20:15 2022 +0800 mac80211: fix EAPoL rekey fail in 802.3 rx path commit 610d086d6df0b15c3732a7b4a5b0f1c3e1b84d4c upstream. mac80211 set capability NL80211_EXT_FEATURE_CONTROL_PORT_OVER_NL80211 to upper layer by default. That means we should pass EAPoL packets through nl80211 path only, and should not send the EAPoL skb to netdevice diretly. At the meanwhile, wpa_supplicant would not register sock to listen EAPoL skb on the netdevice. However, there is no control_port_protocol handler in mac80211 for 802.3 RX packets, mac80211 driver would pass up the EAPoL rekey frame to netdevice and wpa_supplicant would be never interactive with this kind of packets, if SUPPORTS_RX_DECAP_OFFLOAD is enabled. This causes STA always rekey fail if EAPoL frame go through 802.3 path. To avoid this problem, align the same process as 802.11 type to handle this frame before put it into network stack. This also addresses a potential security issue in 802.3 RX mode that was previously fixed in commit a8c4d76a8dd4 ("mac80211: do not accept/forward invalid EAPOL frames"). Cc: [email protected] # 5.12+ Fixes: 80a915ec4427 ("mac80211: add rx decapsulation offload support") Signed-off-by: Deren Wu <[email protected]> Link: https://lore.kernel.org/r/6889c9fced5859ebb088564035f84fd0fa792a49.1644680751.git.deren.wu@mediatek.com [fix typos, update comment and add note about security issue] Signed-off-by: Johannes Berg <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Nicolas Escande <[email protected]> Date: Mon Feb 14 18:32:14 2022 +0100 mac80211: fix forwarded mesh frames AC & queue selection commit 859ae7018316daa4adbc496012dcbbb458d7e510 upstream. There are two problems with the current code that have been highlighted with the AQL feature that is now enbaled by default. First problem is in ieee80211_rx_h_mesh_fwding(), ieee80211_select_queue_80211() is used on received packets to choose the sending AC queue of the forwarding packet although this function should only be called on TX packet (it uses ieee80211_tx_info). This ends with forwarded mesh packets been sent on unrelated random AC queue. To fix that, AC queue can directly be infered from skb->priority which has been extracted from QOS info (see ieee80211_parse_qos()). Second problem is the value of queue_mapping set on forwarded mesh frames via skb_set_queue_mapping() is not the AC of the packet but a hardware queue index. This may or may not work depending on AC to HW queue mapping which is driver specific. Both of these issues lead to improper AC selection while forwarding mesh packets but more importantly due to improper airtime accounting (which is done on a per STA, per AC basis) caused traffic stall with the introduction of AQL. Fixes: cf44012810cc ("mac80211: fix unnecessary frame drops in mesh fwding") Fixes: d3c1597b8d1b ("mac80211: fix forwarded mesh frame queue mapping") Co-developed-by: Remi Pommarel <[email protected]> Signed-off-by: Remi Pommarel <[email protected]> Signed-off-by: Nicolas Escande <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Johannes Berg <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Johannes Berg <[email protected]> Date: Thu Feb 24 10:39:34 2022 +0100 mac80211: treat some SAE auth steps as final commit 94d9864cc86f572f881db9b842a78e9d075493ae upstream. When we get anti-clogging token required (added by the commit mentioned below), or the other status codes added by the later commit 4e56cde15f7d ("mac80211: Handle special status codes in SAE commit") we currently just pretend (towards the internal state machine of authentication) that we didn't receive anything. This has the undesirable consequence of retransmitting the prior frame, which is not expected, because the timer is still armed. If we just disarm the timer at that point, it would result in the undesirable side effect of being in this state indefinitely if userspace crashes, or so. So to fix this, reset the timer and set a new auth_data->waiting in order to have no more retransmissions, but to have the data destroyed when the timer actually fires, which will only happen if userspace didn't continue (i.e. crashed or abandoned it.) Fixes: a4055e74a2ff ("mac80211: Don't destroy auth data in case of anti-clogging") Reported-by: Jouni Malinen <[email protected]> Link: https://lore.kernel.org/r/20220224103932.75964e1d7932.Ia487f91556f29daae734bf61f8181404642e1eec@changeid Signed-off-by: Johannes Berg <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: JaeMan Park <[email protected]> Date: Thu Jan 13 15:02:35 2022 +0900 mac80211_hwsim: initialize ieee80211_tx_info at hw_scan_work [ Upstream commit cacfddf82baf1470e5741edeecb187260868f195 ] In mac80211_hwsim, the probe_req frame is created and sent while scanning. It is sent with ieee80211_tx_info which is not initialized. Uninitialized ieee80211_tx_info can cause problems when using mac80211_hwsim with wmediumd. wmediumd checks the tx_rates field of ieee80211_tx_info and doesn't relay probe_req frame to other clients even if it is a broadcasting message. Call ieee80211_tx_prepare_skb() to initialize ieee80211_tx_info for the probe_req that is created by hw_scan_work in mac80211_hwsim. Signed-off-by: JaeMan Park <[email protected]> Link: https://lore.kernel.org/r/[email protected] [fix memory leak] Signed-off-by: Johannes Berg <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Benjamin Beichler <[email protected]> Date: Tue Jan 11 22:13:26 2022 +0000 mac80211_hwsim: report NOACK frames in tx_status [ Upstream commit 42a79960ffa50bfe9e0bf5d6280be89bf563a5dd ] Add IEEE80211_TX_STAT_NOACK_TRANSMITTED to tx_status flags to have proper statistics for non-acked frames. Signed-off-by: Benjamin Beichler <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Johannes Berg <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Hugh Dickins <[email protected]> Date: Fri Mar 4 20:29:01 2022 -0800 memfd: fix F_SEAL_WRITE after shmem huge page allocated commit f2b277c4d1c63a85127e8aa2588e9cc3bd21cb99 upstream. Wangyong reports: after enabling tmpfs filesystem to support transparent hugepage with the following command: echo always > /sys/kernel/mm/transparent_hugepage/shmem_enabled the docker program tries to add F_SEAL_WRITE through the following command, but it fails unexpectedly with errno EBUSY: fcntl(5, F_ADD_SEALS, F_SEAL_WRITE) = -1. That is because memfd_tag_pins() and memfd_wait_for_pins() were never updated for shmem huge pages: checking page_mapcount() against page_count() is hopeless on THP subpages - they need to check total_mapcount() against page_count() on THP heads only. Make memfd_tag_pins() (compared > 1) as strict as memfd_wait_for_pins() (compared != 1): either can be justified, but given the non-atomic total_mapcount() calculation, it is better now to be strict. Bear in mind that total_mapcount() itself scans all of the THP subpages, when choosing to take an XA_CHECK_SCHED latency break. Also fix the unlikely xa_is_value() case in memfd_wait_for_pins(): if a page has been swapped out since memfd_tag_pins(), then its refcount must have fallen, and so it can safely be untagged. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Hugh Dickins <[email protected]> Reported-by: Zeal Robot <[email protected]> Reported-by: wangyong <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: CGEL ZTE <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Song Liu <[email protected]> Cc: Yang Yang <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Chuanhong Guo <[email protected]> Date: Fri Feb 11 08:13:44 2022 +0800 MIPS: ralink: mt7621: do memory detection on KSEG1 [ Upstream commit cc19db8b312a6c75645645f5cc1b45166b109006 ] It's reported that current memory detection code occasionally detects larger memory under some bootloaders. Current memory detection code tests whether address space wraps around on KSEG0, which is unreliable because it's cached. Rewrite memory size detection to perform the same test on KSEG1 instead. While at it, this patch also does the following two things: 1. use a fixed pattern instead of a random function pointer as the magic value. 2. add an additional memory write and a second comparison as part of the test to prevent possible smaller memory detection result due to leftover values in memory. Fixes: 139c949f7f0a MIPS: ("ralink: mt7621: add memory detection support") Reported-by: Rui Salvaterra <[email protected]> Signed-off-by: Chuanhong Guo <[email protected]> Tested-by: Sergio Paracuellos <[email protected]> Tested-by: Rui Salvaterra <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Ilya Lipnitskiy <[email protected]> Date: Mon Feb 28 17:15:07 2022 -0800 MIPS: ralink: mt7621: use bitwise NOT instead of logical [ Upstream commit 5d8965704fe5662e2e4a7e4424a2cbe53e182670 ] It was the intention to reverse the bits, not make them all zero by using logical NOT operator. Fixes: cc19db8b312a ("MIPS: ralink: mt7621: do memory detection on KSEG1") Suggested-by: Chuanhong Guo <[email protected]> Signed-off-by: Ilya Lipnitskiy <[email protected]> Reviewed-by: Sergio Paracuellos <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Randy Dunlap <[email protected]> Date: Mon Feb 21 09:50:29 2022 -0800 mips: setup: fix setnocoherentio() boolean setting commit 1e6ae0e46e32749b130f1823da30cea9aa2a59a0 upstream. Correct a typo/pasto: setnocoherentio() should set dma_default_coherent to false, not true. Fixes: 14ac09a65e19 ("MIPS: refactor the runtime coherent vs noncoherent DMA indicators") Signed-off-by: Randy Dunlap <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Thomas Bogendoerfer <[email protected]> Cc: [email protected] Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Thomas Bogendoerfer <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Daniel Borkmann <[email protected]> Date: Fri Mar 4 15:26:32 2022 +0100 mm: Consider __GFP_NOWARN flag for oversized kvmalloc() calls commit 0708a0afe291bdfe1386d74d5ec1f0c27e8b9168 upstream. syzkaller was recently triggering an oversized kvmalloc() warning via xdp_umem_create(). The triggered warning was added back in 7661809d493b ("mm: don't allow oversized kvmalloc() calls"). The rationale for the warning for huge kvmalloc sizes was as a reaction to a security bug where the size was more than UINT_MAX but not everything was prepared to handle unsigned long sizes. Anyway, the AF_XDP related call trace from this syzkaller report was: kvmalloc include/linux/mm.h:806 [inline] kvmalloc_array include/linux/mm.h:824 [inline] kvcalloc include/linux/mm.h:829 [inline] xdp_umem_pin_pages net/xdp/xdp_umem.c:102 [inline] xdp_umem_reg net/xdp/xdp_umem.c:219 [inline] xdp_umem_create+0x6a5/0xf00 net/xdp/xdp_umem.c:252 xsk_setsockopt+0x604/0x790 net/xdp/xsk.c:1068 __sys_setsockopt+0x1fd/0x4e0 net/socket.c:2176 __do_sys_setsockopt net/socket.c:2187 [inline] __se_sys_setsockopt net/socket.c:2184 [inline] __x64_sys_setsockopt+0xb5/0x150 net/socket.c:2184 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae Björn mentioned that requests for >2GB allocation can still be valid: The structure that is being allocated is the page-pinning accounting. AF_XDP has an internal limit of U32_MAX pages, which is *a lot*, but still fewer than what memcg allows (PAGE_COUNTER_MAX is a LONG_MAX/ PAGE_SIZE on 64 bit systems). [...] I could just change from U32_MAX to INT_MAX, but as I stated earlier that has a hacky feeling to it. [...] From my perspective, the code isn't broken, with the memcg limits in consideration. [...] Linus says: [...] Pretty much every time this has come up, the kernel warning has shown that yes, the code was broken and there really wasn't a reason for doing allocations that big. Of course, some people would be perfectly fine with the allocation failing, they just don't want the warning. I didn't want __GFP_NOWARN to shut it up originally because I wanted people to see all those cases, but these days I think we can just say "yeah, people can shut it up explicitly by saying 'go ahead and fail this allocation, don't warn about it'". So enough time has passed that by now I'd certainly be ok with [it]. Thus allow call-sites to silence such userspace triggered splats if the allocation requests have __GFP_NOWARN. For xdp_umem_pin_pages()'s call to kvcalloc() this is already the case, so nothing else needed there. Fixes: 7661809d493b ("mm: don't allow oversized kvmalloc() calls") Reported-by: [email protected] Suggested-by: Linus Torvalds <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Tested-by: [email protected] Cc: Björn Töpel <[email protected]> Cc: Magnus Karlsson <[email protected]> Cc: Willy Tarreau <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Alexei Starovoitov <[email protected]> Cc: Andrii Nakryiko <[email protected]> Cc: Jakub Kicinski <[email protected]> Cc: David S. Miller <[email protected]> Link: https://lore.kernel.org/bpf/CAJ+HfNhyfsT5cS_U9EC213ducHs9k9zNxX9+abqC0kTrPbQ0gg@mail.gmail.com Link: https://lore.kernel.org/bpf/[email protected] Reviewed-by: Leon Romanovsky <[email protected]> Ackd-by: Michal Hocko <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Mat Martineau <[email protected]> Date: Thu Feb 24 16:52:59 2022 -0800 mptcp: Correctly set DATA_FIN timeout when number of retransmits is large commit 877d11f0332cd2160e19e3313e262754c321fa36 upstream. Syzkaller with UBSAN uncovered a scenario where a large number of DATA_FIN retransmits caused a shift-out-of-bounds in the DATA_FIN timeout calculation: ================================================================================ UBSAN: shift-out-of-bounds in net/mptcp/protocol.c:470:29 shift exponent 32 is too large for 32-bit type 'unsigned int' CPU: 1 PID: 13059 Comm: kworker/1:0 Not tainted 5.17.0-rc2-00630-g5fbf21c90c60 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 Workqueue: events mptcp_worker Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 ubsan_epilogue+0xb/0x5a lib/ubsan.c:151 __ubsan_handle_shift_out_of_bounds.cold+0xb2/0x20e lib/ubsan.c:330 mptcp_set_datafin_timeout net/mptcp/protocol.c:470 [inline] __mptcp_retrans.cold+0x72/0x77 net/mptcp/protocol.c:2445 mptcp_worker+0x58a/0xa70 net/mptcp/protocol.c:2528 process_one_work+0x9df/0x16d0 kernel/workqueue.c:2307 worker_thread+0x95/0xe10 kernel/workqueue.c:2454 kthread+0x2f4/0x3b0 kernel/kthread.c:377 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295 </TASK> ================================================================================ This change limits the maximum timeout by limiting the size of the shift, which keeps all intermediate values in-bounds. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/259 Fixes: 6477dd39e62c ("mptcp: Retransmit DATA_FIN") Acked-by: Paolo Abeni <[email protected]> Signed-off-by: Mat Martineau <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: D. Wythe <[email protected]> Date: Thu Feb 24 23:26:19 2022 +0800 net/smc: fix connection leak commit 9f1c50cf39167ff71dc5953a3234f3f6eeb8fcb5 upstream. There's a potential leak issue under following execution sequence : smc_release smc_connect_work if (sk->sk_state == SMC_INIT) send_clc_confirim tcp_abort(); ... sk.sk_state = SMC_ACTIVE smc_close_active switch(sk->sk_state) { ... case SMC_ACTIVE: smc_close_final() // then wait peer closed Unfortunately, tcp_abort() may discard CLC CONFIRM messages that are still in the tcp send buffer, in which case our connection token cannot be delivered to the server side, which means that we cannot get a passive close message at all. Therefore, it is impossible for the to be disconnected at all. This patch tries a very simple way to avoid this issue, once the state has changed to SMC_ACTIVE after tcp_abort(), we can actively abort the smc connection, considering that the state is SMC_INIT before tcp_abort(), abandoning the complete disconnection process should not cause too much problem. In fact, this problem may exist as long as the CLC CONFIRM message is not received by the server. Whether a timer should be added after smc_close_final() needs to be discussed in the future. But even so, this patch provides a faster release for connection in above case, it should also be valuable. Fixes: 39f41f367b08 ("net/smc: common release code for non-accepted sockets") Signed-off-by: D. Wythe <[email protected]> Acked-by: Karsten Graul <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: D. Wythe <[email protected]> Date: Wed Mar 2 21:25:12 2022 +0800 net/smc: fix unexpected SMC_CLC_DECL_ERR_REGRMB error cause by server commit 4940a1fdf31c39f0806ac831cde333134862030b upstream. The problem of SMC_CLC_DECL_ERR_REGRMB on the server is very clear. Based on the fact that whether a new SMC connection can be accepted or not depends on not only the limit of conn nums, but also the available entries of rtoken. Since the rtoken release is trigger by peer, while the conn nums is decrease by local, tons of thing can happen in this time difference. This only thing that needs to be mentioned is that now all connection creations are completely protected by smc_server_lgr_pending lock, it's enough to check only the available entries in rtokens_used_mask. Fixes: cd6851f30386 ("smc: remote memory buffers (RMBs)") Signed-off-by: D. Wythe <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: D. Wythe <[email protected]> Date: Wed Mar 2 21:25:11 2022 +0800 net/smc: fix unexpected SMC_CLC_DECL_ERR_REGRMB error generated by client commit 0537f0a2151375dcf90c1bbfda6a0aaf57164e89 upstream. The main reason for this unexpected SMC_CLC_DECL_ERR_REGRMB in client dues to following execution sequence: Server Conn A: Server Conn B: Client Conn B: smc_lgr_unregister_conn smc_lgr_register_conn smc_clc_send_accept -> smc_rtoken_add smcr_buf_unuse -> Client Conn A: smc_rtoken_delete smc_lgr_unregister_conn() makes current link available to assigned to new incoming connection, while smcr_buf_unuse() has not executed yet, which means that smc_rtoken_add may fail because of insufficient rtoken_entry, reversing their execution order will avoid this problem. Fixes: 3e034725c0d8 ("net/smc: common functions for RMBs and send buffers") Signed-off-by: D. Wythe <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Zheyu Ma <[email protected]> Date: Wed Mar 2 20:24:23 2022 +0800 net: arcnet: com20020: Fix null-ptr-deref in com20020pci_probe() commit bd6f1fd5d33dfe5d1b4f2502d3694a7cc13f166d upstream. During driver initialization, the pointer of card info, i.e. the variable 'ci' is required. However, the definition of 'com20020pci_id_table' reveals that this field is empty for some devices, which will cause null pointer dereference when initializing these devices. The following log reveals it: [ 3.973806] KASAN: null-ptr-deref in range [0x0000000000000028-0x000000000000002f] [ 3.973819] RIP: 0010:com20020pci_probe+0x18d/0x13e0 [com20020_pci] [ 3.975181] Call Trace: [ 3.976208] local_pci_probe+0x13f/0x210 [ 3.977248] pci_device_probe+0x34c/0x6d0 [ 3.977255] ? pci_uevent+0x470/0x470 [ 3.978265] really_probe+0x24c/0x8d0 [ 3.978273] __driver_probe_device+0x1b3/0x280 [ 3.979288] driver_probe_device+0x50/0x370 Fix this by checking whether the 'ci' is a null pointer first. Fixes: 8c14f9c70327 ("ARCNET: add com20020 PCI IDs with metadata") Signed-off-by: Zheyu Ma <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Jia-Ju Bai <[email protected]> Date: Fri Feb 25 04:37:27 2022 -0800 net: chelsio: cxgb3: check the return value of pci_find_capability() [ Upstream commit 767b9825ed1765894e569a3d698749d40d83762a ] The function pci_find_capability() in t3_prep_adapter() can fail, so its return value should be checked. Fixes: 4d22de3e6cc4 ("Add support for the latest 1G/10G Chelsio adapter, T3") Reported-by: TOTE Robot <[email protected]> Signed-off-by: Jia-Ju Bai <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Vladimir Oltean <[email protected]> Date: Wed Mar 2 21:39:39 2022 +0200 net: dcb: disable softirqs in dcbnl_flush_dev() [ Upstream commit 10b6bb62ae1a49ee818fc479cf57b8900176773e ] Ido Schimmel points out that since commit 52cff74eef5d ("dcbnl : Disable software interrupts before taking dcb_lock"), the DCB API can be called by drivers from softirq context. One such in-tree example is the chelsio cxgb4 driver: dcb_rpl -> cxgb4_dcb_handle_fw_update -> dcb_ieee_setapp If the firmware for this driver happened to send an event which resulted in a call to dcb_ieee_setapp() at the exact same time as another DCB-enabled interface was unregistering on the same CPU, the softirq would deadlock, because the interrupted process was already holding the dcb_lock in dcbnl_flush_dev(). Fix this unlikely event by using spin_lock_bh() in dcbnl_flush_dev() as in the rest of the dcbnl code. Fixes: 91b0383fef06 ("net: dcb: flush lingering app table entries for unregistered devices") Reported-by: Ido Schimmel <[email protected]> Signed-off-by: Vladimir Oltean <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Vladimir Oltean <[email protected]> Date: Thu Feb 24 18:01:54 2022 +0200 net: dcb: flush lingering app table entries for unregistered devices commit 91b0383fef06f20b847fa9e4f0e3054ead0b1a1b upstream. If I'm not mistaken (and I don't think I am), the way in which the dcbnl_ops work is that drivers call dcb_ieee_setapp() and this populates the application table with dynamically allocated struct dcb_app_type entries that are kept in the module-global dcb_app_list. However, nobody keeps exact track of these entries, and although dcb_ieee_delapp() is supposed to remove them, nobody does so when the interface goes away (example: driver unbinds from device). So the dcb_app_list will contain lingering entries with an ifindex that no longer matches any device in dcb_app_lookup(). Reclaim the lost memory by listening for the NETDEV_UNREGISTER event and flushing the app table entries of interfaces that are now gone. In fact something like this used to be done as part of the initial commit (blamed below), but it was done in dcbnl_exit() -> dcb_flushapp(), essentially at module_exit time. That became dead code after commit 7a6b6f515f77 ("DCB: fix kconfig option") which essentially merged "tristate config DCB" and "bool config DCBNL" into a single "bool config DCB", so net/dcb/dcbnl.c could not be built as a module anymore. Commit 36b9ad8084bd ("net/dcb: make dcbnl.c explicitly non-modular") recognized this and deleted dcbnl_exit() and dcb_flushapp() altogether, leaving us with the version we have today. Since flushing application table entries can and should be done as soon as the netdevice disappears, fundamentally the commit that is to blame is the one that introduced the design of this API. Fixes: 9ab933ab2cc8 ("dcbnl: add appliction tlv handlers") Signed-off-by: Vladimir Oltean <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Svenning Sørensen <[email protected]> Date: Fri Feb 18 11:27:01 2022 +0000 net: dsa: microchip: fix bridging with more than two member ports commit 3d00827a90db6f79abc7cdc553887f89a2e0a184 upstream. Commit b3612ccdf284 ("net: dsa: microchip: implement multi-bridge support") plugged a packet leak between ports that were members of different bridges. Unfortunately, this broke another use case, namely that of more than two ports that are members of the same bridge. After that commit, when a port is added to a bridge, hardware bridging between other member ports of that bridge will be cleared, preventing packet exchange between them. Fix by ensuring that the Port VLAN Membership bitmap includes any existing ports in the bridge, not just the port being added. Fixes: b3612ccdf284 ("net: dsa: microchip: implement multi-bridge support") Signed-off-by: Svenning Sørensen <[email protected]> Tested-by: Oleksij Rempel <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: lena wang <[email protected]> Date: Tue Mar 1 19:17:09 2022 +0800 net: fix up skbs delta_truesize in UDP GRO frag_list commit 224102de2ff105a2c05695e66a08f4b5b6b2d19c upstream. The truesize for a UDP GRO packet is added by main skb and skbs in main skb's frag_list: skb_gro_receive_list p->truesize += skb->truesize; The commit 53475c5dd856 ("net: fix use-after-free when UDP GRO with shared fraglist") introduced a truesize increase for frag_list skbs. When uncloning skb, it will call pskb_expand_head and trusesize for frag_list skbs may increase. This can occur when allocators uses __netdev_alloc_skb and not jump into __alloc_skb. This flow does not use ksize(len) to calculate truesize while pskb_expand_head uses. skb_segment_list err = skb_unclone(nskb, GFP_ATOMIC); pskb_expand_head if (!skb->sk || skb->destructor == sock_edemux) skb->truesize += size - osize; If we uses increased truesize adding as delta_truesize, it will be larger than before and even larger than previous total truesize value if skbs in frag_list are abundant. The main skb truesize will become smaller and even a minus value or a huge value for an unsigned int parameter. Then the following memory check will drop this abnormal skb. To avoid this error we should use the original truesize to segment the main skb. Fixes: 53475c5dd856 ("net: fix use-after-free when UDP GRO with shared fraglist") Signed-off-by: lena wang <[email protected]> Acked-by: Paolo Abeni <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Alex Elder <[email protected]> Date: Tue Mar 1 05:34:40 2022 -0600 net: ipa: add an interconnect dependency commit 1dba41c9d2e2dc94b543394974f63d55aa195bfe upstream. In order to function, the IPA driver very clearly requires the interconnect framework to be enabled in the kernel configuration. State that dependency in the Kconfig file. This became a problem when CONFIG_COMPILE_TEST support was added. Non-Qualcomm platforms won't necessarily enable CONFIG_INTERCONNECT. Reported-by: kernel test robot <[email protected]> Fixes: 38a4066f593c5 ("net: ipa: support COMPILE_TEST") Signed-off-by: Alex Elder <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Alex Elder <[email protected]> Date: Fri Feb 25 14:15:30 2022 -0600 net: ipa: fix a build dependency commit caef14b7530c065fb85d54492768fa48fdb5093e upstream. An IPA build problem arose in the linux-next tree the other day. The problem is that a recent commit adds a new dependency on some code, and the Kconfig file for IPA doesn't reflect that dependency. As a result, some configurations can fail to build (particularly when COMPILE_TEST is enabled). The recent patch adds calls to qmp_get(), qmp_put(), and qmp_send(), and those are built based on the QCOM_AOSS_QMP config option. If that symbol is not defined, stubs are defined, so we just need to ensure QCOM_AOSS_QMP is compatible with QCOM_IPA, or it's not defined. Reported-by: Randy Dunlap <[email protected]> Fixes: 34a081761e4e3 ("net: ipa: request IPA register values be retained") Signed-off-by: Alex Elder <[email protected]> Tested-by: Randy Dunlap <[email protected]> Acked-by: Randy Dunlap <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: [email protected] <[email protected]> Date: Thu Feb 24 10:06:49 2022 +0100 net: ipv6: ensure we call ipv6_mc_down() at most once commit 9995b408f17ff8c7f11bc725c8aa225ba3a63b1c upstream. There are two reasons for addrconf_notify() to be called with NETDEV_DOWN: either the network device is actually going down, or IPv6 was disabled on the interface. If either of them stays down while the other is toggled, we repeatedly call the code for NETDEV_DOWN, including ipv6_mc_down(), while never calling the corresponding ipv6_mc_up() in between. This will cause a new entry in idev->mc_tomb to be allocated for each multicast group the interface is subscribed to, which in turn leaks one struct ifmcaddr6 per nontrivial multicast group the interface is subscribed to. The following reproducer will leak at least $n objects: ip addr add ff2e::4242/32 dev eth0 autojoin sysctl -w net.ipv6.conf.eth0.disable_ipv6=1 for i in $(seq 1 $n); do ip link set up eth0; ip link set down eth0 done Joining groups with IPV6_ADD_MEMBERSHIP (unprivileged) or setting the sysctl net.ipv6.conf.eth0.forwarding to 1 (=> subscribing to ff02::2) can also be used to create a nontrivial idev->mc_list, which will the leak objects with the right up-down-sequence. Based on both sources for NETDEV_DOWN events the interface IPv6 state should be considered: - not ready if the network interface is not ready OR IPv6 is disabled for it - ready if the network interface is ready AND IPv6 is enabled for it The functions ipv6_mc_up() and ipv6_down() should only be run when this state changes. Implement this by remembering when the IPv6 state is ready, and only run ipv6_mc_down() if it actually changed from ready to not ready. The other direction (not ready -> ready) already works correctly, as: - the interface notification triggered codepath for NETDEV_UP / NETDEV_CHANGE returns early if ipv6 is disabled, and - the disable_ipv6=0 triggered codepath skips fully initializing the interface as long as addrconf_link_ready(dev) returns false - calling ipv6_mc_up() repeatedly does not leak anything Fixes: 3ce62a84d53c ("ipv6: exit early in addrconf_notify() if IPv6 is disabled") Signed-off-by: Johannes Nixdorf <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Casper Andersson <[email protected]> Date: Fri Feb 25 11:15:16 2022 +0100 net: sparx5: Fix add vlan when invalid operation [ Upstream commit b3a34dc362c03215031b268fcc0b988e69490231 ] Check if operation is valid before changing any settings in hardware. Otherwise it results in changes being made despite it not being a valid operation. Fixes: 78eab33bb68b ("net: sparx5: add vlan support") Signed-off-by: Casper Andersson <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Ong Boon Leong <[email protected]> Date: Thu Nov 11 22:39:49 2021 +0800 net: stmmac: enhance XDP ZC driver level switching performance [ Upstream commit ac746c8520d9d056b6963ecca8ff1da9929d02f1 ] The previous stmmac_xdp_set_prog() implementation uses stmmac_release() and stmmac_open() which tear down the PHY device and causes undesirable autonegotiation which causes a delay whenever AFXDP ZC is setup. This patch introduces two new functions that just sufficiently tear down DMA descriptors, buffer, NAPI process, and IRQs and reestablish them accordingly in both stmmac_xdp_release() and stammac_xdp_open(). As the results of this enhancement, we get rid of transient state introduced by the link auto-negotiation: $ ./xdpsock -i eth0 -t -z sock0@eth0:0 txonly xdp-drv pps pkts 1.00 rx 0 0 tx 634444 634560 sock0@eth0:0 txonly xdp-drv pps pkts 1.00 rx 0 0 tx 632330 1267072 sock0@eth0:0 txonly xdp-drv pps pkts 1.00 rx 0 0 tx 632438 1899584 sock0@eth0:0 txonly xdp-drv pps pkts 1.00 rx 0 0 tx 632502 2532160 Reported-by: Kurt Kanzenbach <[email protected]> Signed-off-by: Ong Boon Leong <[email protected]> Tested-by: Kurt Kanzenbach <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Randy Dunlap <[email protected]> Date: Wed Feb 23 19:35:36 2022 -0800 net: stmmac: fix return value of __setup handler commit e01b042e580f1fbf4fd8da467442451da00c7a90 upstream. __setup() handlers should return 1 on success, i.e., the parameter has been handled. A return of 0 causes the "option=value" string to be added to init's environment strings, polluting it. Fixes: 47dd7a540b8a ("net: add support for STMicroelectronics Ethernet controllers.") Fixes: f3240e2811f0 ("stmmac: remove warning when compile as built-in (V2)") Signed-off-by: Randy Dunlap <[email protected]> Reported-by: Igor Zhbanov <[email protected]> Link: lore.kernel.org/r/[email protected] Cc: Giuseppe Cavallaro <[email protected]> Cc: Alexandre Torgue <[email protected]> Cc: Jose Abreu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Vincent Whitchurch <[email protected]> Date: Thu Feb 24 12:38:29 2022 +0100 net: stmmac: only enable DMA interrupts when ready [ Upstream commit 087a7b944c5db409f7c1a68bf4896c56ba54eaff ] In this driver's ->ndo_open() callback, it enables DMA interrupts, starts the DMA channels, then requests interrupts with request_irq(), and then finally enables napi. If RX DMA interrupts are received before napi is enabled, no processing is done because napi_schedule_prep() will return false. If the network has a lot of broadcast/multicast traffic, then the RX ring could fill up completely before napi is enabled. When this happens, no further RX interrupts will be delivered, and the driver will fail to receive any packets. Fix this by only enabling DMA interrupts after all other initialization is complete. Fixes: 523f11b5d4fd72efb ("net: stmmac: move hardware setup for stmmac_open to new function") Reported-by: Lars Persson <[email protected]> Signed-off-by: Vincent Whitchurch <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Ong Boon Leong <[email protected]> Date: Wed Nov 24 19:40:19 2021 +0800 net: stmmac: perserve TX and RX coalesce value during XDP setup commit 61da6ac715700bcfeef50d187e15c6cc7c9d079b upstream. When XDP program is loaded, it is desirable that the previous TX and RX coalesce values are not re-inited to its default value. This prevents unnecessary re-configurig the coalesce values that were working fine before. Fixes: ac746c8520d9 ("net: stmmac: enhance XDP ZC driver level switching performance") Signed-off-by: Ong Boon Leong <[email protected]> Tested-by: Kurt Kanzenbach <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Randy Dunlap <[email protected]> Date: Wed Feb 23 19:35:28 2022 -0800 net: sxgbe: fix return value of __setup handler commit 50e06ddceeea263f57fe92baa677c638ecd65bb6 upstream. __setup() handlers should return 1 on success, i.e., the parameter has been handled. A return of 0 causes the "option=value" string to be added to init's environment strings, polluting it. Fixes: acc18c147b22 ("net: sxgbe: add EEE(Energy Efficient Ethernet) for Samsung sxgbe") Fixes: 1edb9ca69e8a ("net: sxgbe: add basic framework for Samsung 10Gb ethernet driver") Signed-off-by: Randy Dunlap <[email protected]> Reported-by: Igor Zhbanov <[email protected]> Link: lore.kernel.org/r/[email protected] Cc: Siva Reddy <[email protected]> Cc: Girish K S <[email protected]> Cc: Byungho An <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Daniele Palmas <[email protected]> Date: Tue Feb 15 12:13:35 2022 +0100 net: usb: cdc_mbim: avoid altsetting toggling for Telit FN990 [ Upstream commit 21e8a96377e6b6debae42164605bf9dcbe5720c5 ] Add quirk CDC_MBIM_FLAG_AVOID_ALTSETTING_TOGGLE for Telit FN990 0x1071 composition in order to avoid bind error. Signed-off-by: Daniele Palmas <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Eric Dumazet <[email protected]> Date: Sun Feb 27 10:01:41 2022 -0800 netfilter: fix use-after-free in __nf_register_net_hook() commit 56763f12b0f02706576a088e85ef856deacc98a0 upstream. We must not dereference @new_hooks after nf_hook_mutex has been released, because other threads might have freed our allocated hooks already. BUG: KASAN: use-after-free in nf_hook_entries_get_hook_ops include/linux/netfilter.h:130 [inline] BUG: KASAN: use-after-free in hooks_validate net/netfilter/core.c:171 [inline] BUG: KASAN: use-after-free in __nf_register_net_hook+0x77a/0x820 net/netfilter/core.c:438 Read of size 2 at addr ffff88801c1a8000 by task syz-executor237/4430 CPU: 1 PID: 4430 Comm: syz-executor237 Not tainted 5.17.0-rc5-syzkaller-00306-g2293be58d6a1 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 print_address_description.constprop.0.cold+0x8d/0x336 mm/kasan/report.c:255 __kasan_report mm/kasan/report.c:442 [inline] kasan_report.cold+0x83/0xdf mm/kasan/report.c:459 nf_hook_entries_get_hook_ops include/linux/netfilter.h:130 [inline] hooks_validate net/netfilter/core.c:171 [inline] __nf_register_net_hook+0x77a/0x820 net/netfilter/core.c:438 nf_register_net_hook+0x114/0x170 net/netfilter/core.c:571 nf_register_net_hooks+0x59/0xc0 net/netfilter/core.c:587 nf_synproxy_ipv6_init+0x85/0xe0 net/netfilter/nf_synproxy_core.c:1218 synproxy_tg6_check+0x30d/0x560 net/ipv6/netfilter/ip6t_SYNPROXY.c:81 xt_check_target+0x26c/0x9e0 net/netfilter/x_tables.c:1038 check_target net/ipv6/netfilter/ip6_tables.c:530 [inline] find_check_entry.constprop.0+0x7f1/0x9e0 net/ipv6/netfilter/ip6_tables.c:573 translate_table+0xc8b/0x1750 net/ipv6/netfilter/ip6_tables.c:735 do_replace net/ipv6/netfilter/ip6_tables.c:1153 [inline] do_ip6t_set_ctl+0x56e/0xb90 net/ipv6/netfilter/ip6_tables.c:1639 nf_setsockopt+0x83/0xe0 net/netfilter/nf_sockopt.c:101 ipv6_setsockopt+0x122/0x180 net/ipv6/ipv6_sockglue.c:1024 rawv6_setsockopt+0xd3/0x6a0 net/ipv6/raw.c:1084 __sys_setsockopt+0x2db/0x610 net/socket.c:2180 __do_sys_setsockopt net/socket.c:2191 [inline] __se_sys_setsockopt net/socket.c:2188 [inline] __x64_sys_setsockopt+0xba/0x150 net/socket.c:2188 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f65a1ace7d9 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 71 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f65a1a7f308 EFLAGS: 00000246 ORIG_RAX: 0000000000000036 RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007f65a1ace7d9 RDX: 0000000000000040 RSI: 0000000000000029 RDI: 0000000000000003 RBP: 00007f65a1b574c8 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000020000000 R11: 0000000000000246 R12: 00007f65a1b55130 R13: 00007f65a1b574c0 R14: 00007f65a1b24090 R15: 0000000000022000 </TASK> The buggy address belongs to the page: page:ffffea0000706a00 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1c1a8 flags: 0xfff00000000000(node=0|zone=1|lastcpupid=0x7ff) raw: 00fff00000000000 ffffea0001c1b108 ffffea000046dd08 0000000000000000 raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected page_owner tracks the page as freed page last allocated via order 2, migratetype Unmovable, gfp_mask 0x52dc0(GFP_KERNEL|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_ZERO), pid 4430, ts 1061781545818, free_ts 1061791488993 prep_new_page mm/page_alloc.c:2434 [inline] get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4165 __alloc_pages+0x1b2/0x500 mm/page_alloc.c:5389 __alloc_pages_node include/linux/gfp.h:572 [inline] alloc_pages_node include/linux/gfp.h:595 [inline] kmalloc_large_node+0x62/0x130 mm/slub.c:4438 __kmalloc_node+0x35a/0x4a0 mm/slub.c:4454 kmalloc_node include/linux/slab.h:604 [inline] kvmalloc_node+0x97/0x100 mm/util.c:580 kvmalloc include/linux/slab.h:731 [inline] kvzalloc include/linux/slab.h:739 [inline] allocate_hook_entries_size net/netfilter/core.c:61 [inline] nf_hook_entries_grow+0x140/0x780 net/netfilter/core.c:128 __nf_register_net_hook+0x144/0x820 net/netfilter/core.c:429 nf_register_net_hook+0x114/0x170 net/netfilter/core.c:571 nf_register_net_hooks+0x59/0xc0 net/netfilter/core.c:587 nf_synproxy_ipv6_init+0x85/0xe0 net/netfilter/nf_synproxy_core.c:1218 synproxy_tg6_check+0x30d/0x560 net/ipv6/netfilter/ip6t_SYNPROXY.c:81 xt_check_target+0x26c/0x9e0 net/netfilter/x_tables.c:1038 check_target net/ipv6/netfilter/ip6_tables.c:530 [inline] find_check_entry.constprop.0+0x7f1/0x9e0 net/ipv6/netfilter/ip6_tables.c:573 translate_table+0xc8b/0x1750 net/ipv6/netfilter/ip6_tables.c:735 do_replace net/ipv6/netfilter/ip6_tables.c:1153 [inline] do_ip6t_set_ctl+0x56e/0xb90 net/ipv6/netfilter/ip6_tables.c:1639 nf_setsockopt+0x83/0xe0 net/netfilter/nf_sockopt.c:101 page last free stack trace: reset_page_owner include/linux/page_owner.h:24 [inline] free_pages_prepare mm/page_alloc.c:1352 [inline] free_pcp_prepare+0x374/0x870 mm/page_alloc.c:1404 free_unref_page_prepare mm/page_alloc.c:3325 [inline] free_unref_page+0x19/0x690 mm/page_alloc.c:3404 kvfree+0x42/0x50 mm/util.c:613 rcu_do_batch kernel/rcu/tree.c:2527 [inline] rcu_core+0x7b1/0x1820 kernel/rcu/tree.c:2778 __do_softirq+0x29b/0x9c2 kernel/softirq.c:558 Memory state around the buggy address: ffff88801c1a7f00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff88801c1a7f80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff >ffff88801c1a8000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ^ ffff88801c1a8080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff88801c1a8100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff Fixes: 2420b79f8c18 ("netfilter: debug: check for sorted array") Signed-off-by: Eric Dumazet <[email protected]> Reported-by: syzbot <[email protected]> Acked-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Florian Westphal <[email protected]> Date: Fri Feb 25 14:02:41 2022 +0100 netfilter: nf_queue: don't assume sk is full socket commit 747670fd9a2d1b7774030dba65ca022ba442ce71 upstream. There is no guarantee that state->sk refers to a full socket. If refcount transitions to 0, sock_put calls sk_free which then ends up with garbage fields. I'd like to thank Oleksandr Natalenko and Jiri Benc for considerable debug work and pointing out state->sk oddities. Fixes: ca6fb0651883 ("tcp: attach SYNACK messages to request sockets instead of listener") Tested-by: Oleksandr Natalenko <[email protected]> Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Florian Westphal <[email protected]> Date: Mon Feb 28 06:22:22 2022 +0100 netfilter: nf_queue: fix possible use-after-free commit c3873070247d9e3c7a6b0cf9bf9b45e8018427b1 upstream. Eric Dumazet says: The sock_hold() side seems suspect, because there is no guarantee that sk_refcnt is not already 0. On failure, we cannot queue the packet and need to indicate an error. The packet will be dropped by the caller. v2: split skb prefetch hunk into separate change Fixes: 271b72c7fa82c ("udp: RCU handling for Unicast packets.") Reported-by: Eric Dumazet <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Florian Westphal <[email protected]> Date: Tue Mar 1 00:46:19 2022 +0100 netfilter: nf_queue: handle socket prefetch commit 3b836da4081fa585cf6c392f62557496f2cb0efe upstream. In case someone combines bpf socket assign and nf_queue, then we will queue an skb who references a struct sock that did not have its reference count incremented. As we leave rcu protection, there is no guarantee that skb->sk is still valid. For refcount-less skb->sk case, try to increment the reference count and then override the destructor. In case of failure we have two choices: orphan the skb and 'delete' preselect or let nf_queue() drop the packet. Do the latter, it should not happen during normal operation. Fixes: cf7fbe660f2d ("bpf: Add socket assign support") Acked-by: Joe Stringer <[email protected]> Signed-off-by: Florian Westphal <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Eric Dumazet <[email protected]> Date: Tue Feb 22 10:13:31 2022 -0800 netfilter: nf_tables: prefer kfree_rcu(ptr, rcu) variant [ Upstream commit ae089831ff28a115908b8d796f667c2dadef1637 ] While kfree_rcu(ptr) _is_ supported, it has some limitations. Given that 99.99% of kfree_rcu() users [1] use the legacy two parameters variant, and @catchall objects do have an rcu head, simply use it. Choice of kfree_rcu(ptr) variant was probably not intentional. [1] including calls from net/netfilter/nf_tables_api.c Fixes: aaa31047a6d2 ("netfilter: nftables: add catch-all set element support") Signed-off-by: Eric Dumazet <[email protected]> Reviewed-by: Florian Westphal <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Jiasheng Jiang <[email protected]> Date: Tue Mar 1 18:00:20 2022 +0800 nl80211: Handle nla_memdup failures in handle_nan_filter [ Upstream commit 6ad27f522cb3b210476daf63ce6ddb6568c0508b ] As there's potential for failure of the nla_memdup(), check the return value. Fixes: a442b761b24b ("cfg80211: add add_nan_func / del_nan_func") Signed-off-by: Jiasheng Jiang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Johannes Berg <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Dave Jiang <[email protected]> Date: Thu Jan 27 13:31:12 2022 -0700 ntb: intel: fix port config status offset for SPR commit d5081bf5dcfb1cb83fb538708b0ac07a10a79cc4 upstream. The field offset for port configuration status on SPR has been changed to bit 14 from ICX where it resides at bit 12. By chance link status detection continued to work on SPR. This is due to bit 12 being a configuration bit which is in sync with the status bit. Fix this by checking for a SPR device and checking correct status bit. Fixes: 26bfe3d0b227 ("ntb: intel: Add Icelake (gen4) support for Intel NTB") Tested-by: Jerry Dai <[email protected]> Signed-off-by: Dave Jiang <[email protected]> Signed-off-by: Jon Mason <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Samuel Holland <[email protected]> Date: Tue Feb 15 22:00:36 2022 -0600 pinctrl: sunxi: Use unique lockdep classes for IRQs commit bac129dbc6560dfeb634c03f0c08b78024e71915 upstream. This driver, like several others, uses a chained IRQ for each GPIO bank, and forwards .irq_set_wake to the GPIO bank's upstream IRQ. As a result, a call to irq_set_irq_wake() needs to lock both the upstream and downstream irq_desc's. Lockdep considers this to be a possible deadlock when the irq_desc's share lockdep classes, which they do by default: ============================================ WARNING: possible recursive locking detected 5.17.0-rc3-00394-gc849047c2473 #1 Not tainted -------------------------------------------- init/307 is trying to acquire lock: c2dfe27c (&irq_desc_lock_class){-.-.}-{2:2}, at: __irq_get_desc_lock+0x58/0xa0 but task is already holding lock: c3c0ac7c (&irq_desc_lock_class){-.-.}-{2:2}, at: __irq_get_desc_lock+0x58/0xa0 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&irq_desc_lock_class); lock(&irq_desc_lock_class); *** DEADLOCK *** May be due to missing lock nesting notation 4 locks held by init/307: #0: c1f29f18 (system_transition_mutex){+.+.}-{3:3}, at: __do_sys_reboot+0x90/0x23c #1: c20f7760 (&dev->mutex){....}-{3:3}, at: device_shutdown+0xf4/0x224 #2: c2e804d8 (&dev->mutex){....}-{3:3}, at: device_shutdown+0x104/0x224 #3: c3c0ac7c (&irq_desc_lock_class){-.-.}-{2:2}, at: __irq_get_desc_lock+0x58/0xa0 stack backtrace: CPU: 0 PID: 307 Comm: init Not tainted 5.17.0-rc3-00394-gc849047c2473 #1 Hardware name: Allwinner sun8i Family unwind_backtrace from show_stack+0x10/0x14 show_stack from dump_stack_lvl+0x68/0x90 dump_stack_lvl from __lock_acquire+0x1680/0x31a0 __lock_acquire from lock_acquire+0x148/0x3dc lock_acquire from _raw_spin_lock_irqsave+0x50/0x6c _raw_spin_lock_irqsave from __irq_get_desc_lock+0x58/0xa0 __irq_get_desc_lock from irq_set_irq_wake+0x2c/0x19c irq_set_irq_wake from irq_set_irq_wake+0x13c/0x19c [tail call from sunxi_pinctrl_irq_set_wake] irq_set_irq_wake from gpio_keys_suspend+0x80/0x1a4 gpio_keys_suspend from gpio_keys_shutdown+0x10/0x2c gpio_keys_shutdown from device_shutdown+0x180/0x224 device_shutdown from __do_sys_reboot+0x134/0x23c __do_sys_reboot from ret_fast_syscall+0x0/0x1c However, this can never deadlock because the upstream and downstream IRQs are never the same (nor do they even involve the same irqchip). Silence this erroneous lockdep splat by applying what appears to be the usual fix of moving the GPIO IRQs to separate lockdep classes. Fixes: a59c99d9eaf9 ("pinctrl: sunxi: Forward calls to irq_set_irq_wake") Reported-by: Guenter Roeck <[email protected]> Signed-off-by: Samuel Holland <[email protected]> Reviewed-by: Jernej Skrabec <[email protected]> Tested-by: Guenter Roeck <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Linus Walleij <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Mario Limonciello <[email protected]> Date: Wed Feb 23 11:52:37 2022 -0600 platform/x86: amd-pmc: Set QOS during suspend on CZN w/ timer wakeup commit 68af28426b3ca1bf9ba21c7d8bdd0ff639e5134c upstream. commit 59348401ebed ("platform/x86: amd-pmc: Add special handling for timer based S0i3 wakeup") adds support for using another platform timer in lieu of the RTC which doesn't work properly on some systems. This path was validated and worked well before submission. During the 5.16-rc1 merge window other patches were merged that caused this to stop working properly. When this feature was used with 5.16-rc1 or later some OEM laptops with the matching firmware requirements from that commit would shutdown instead of program a timer based wakeup. This was bisected to commit 8d89835b0467 ("PM: suspend: Do not pause cpuidle in the suspend-to-idle path"). This wasn't supposed to cause any negative impacts and also tested well on both Intel and ARM platforms. However this changed the semantics of when CPUs are allowed to be in the deepest state. For the AMD systems in question it appears this causes a firmware crash for timer based wakeup. It's hypothesized to be caused by the `amd-pmc` driver sending `OS_HINT` and all the CPUs going into a deep state while the timer is still being programmed. It's likely a firmware bug, but to avoid it don't allow setting CPUs into the deepest state while using CZN timer wakeup path. If later it's discovered that this also occurs from "regular" suspends without a timer as well or on other silicon, this may be later expanded to run in the suspend path for more scenarios. Cc: [email protected] # 5.16+ Suggested-by: Rafael J. Wysocki <[email protected]> Link: https://lore.kernel.org/linux-acpi/BL1PR12MB51570F5BD05980A0DCA1F3F4E23A9@BL1PR12MB5157.namprd12.prod.outlook.com/T/#mee35f39c41a04b624700ab2621c795367f19c90e Fixes: 8d89835b0467 ("PM: suspend: Do not pause cpuidle in the suspend-to-idle path") Fixes: 23f62d7ab25b ("PM: sleep: Pause cpuidle later and resume it earlier during system transitions") Fixes: 59348401ebed ("platform/x86: amd-pmc: Add special handling for timer based S0i3 wakeup" Reviewed-by: Rafael J. Wysocki <[email protected]> Signed-off-by: Mario Limonciello <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Hans de Goede <[email protected]> Signed-off-by: Hans de Goede <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Yun Zhou <[email protected]> Date: Fri Mar 4 20:29:07 2022 -0800 proc: fix documentation and description of pagemap commit dd21bfa425c098b95ca86845f8e7d1ec1ddf6e4a upstream. Since bit 57 was exported for uffd-wp write-protected (commit fb8e37f35a2f: "mm/pagemap: export uffd-wp protection information"), fixing it can reduce some unnecessary confusion. Link: https://lkml.kernel.org/r/[email protected] Fixes: fb8e37f35a2fe1 ("mm/pagemap: export uffd-wp protection information") Signed-off-by: Yun Zhou <[email protected]> Reviewed-by: Peter Xu <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Tiberiu A Georgescu <[email protected]> Cc: Florian Schmidt <[email protected]> Cc: Ivan Teterevkov <[email protected]> Cc: SeongJae Park <[email protected]> Cc: Yang Shi <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Axel Rasmussen <[email protected]> Cc: Miaohe Lin <[email protected]> Cc: Andrea Arcangeli <[email protected]> Cc: Colin Cross <[email protected]> Cc: Alistair Popple <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Jonathan Lemon <[email protected]> Date: Mon Feb 28 12:39:57 2022 -0800 ptp: ocp: Add ptp_ocp_adjtime_coarse for large adjustments [ Upstream commit 90f8f4c0e3cebd541deaa45cf0e470bb9810dd4f ] In ("ptp: ocp: Have FPGA fold in ns adjustment for adjtime."), the ns adjustment was written to the FPGA register, so the clock could accurately perform adjustments. However, the adjtime() call passes in a s64, while the clock adjustment registers use a s32. When trying to perform adjustments with a large value (37 sec), things fail. Examine the incoming delta, and if larger than 1 sec, use the original (coarse) adjustment method. If smaller than 1 sec, then allow the FPGA to fold in the changes over a 1 second window. Fixes: 6d59d4fa1789 ("ptp: ocp: Have FPGA fold in ns adjustment for adjtime.") Signed-off-by: Jonathan Lemon <[email protected]> Acked-by: Richard Cochran <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Oliver Barta <[email protected]> Date: Tue Feb 8 09:46:45 2022 +0100 regulator: core: fix false positive in regulator_late_cleanup() [ Upstream commit 4e2a354e3775870ca823f1fb29bbbffbe11059a6 ] The check done by regulator_late_cleanup() to detect whether a regulator is on was inconsistent with the check done by _regulator_is_enabled(). While _regulator_is_enabled() takes the enable GPIO into account, regulator_late_cleanup() was not doing that. This resulted in a false positive, e.g. when a GPIO-controlled fixed regulator was used, which was not enabled at boot time, e.g. reg_disp_1v2: reg_disp_1v2 { compatible = "regulator-fixed"; regulator-name = "display_1v2"; regulator-min-microvolt = <1200000>; regulator-max-microvolt = <1200000>; gpio = <&tlmm 148 0>; enable-active-high; }; Such regulator doesn't have an is_enabled() operation. Nevertheless it's state can be determined based on the enable GPIO. The check in regulator_late_cleanup() wrongly assumed that the regulator is on and tried to disable it. Signed-off-by: Oliver Barta <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Mark Brown <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Jiri Bohac <[email protected]> Date: Wed Jan 26 16:00:18 2022 +0100 Revert "xfrm: xfrm_state_mtu should return at least 1280 for ipv6" commit a6d95c5a628a09be129f25d5663a7e9db8261f51 upstream. This reverts commit b515d2637276a3810d6595e10ab02c13bfd0b63a. Commit b515d2637276a3810d6595e10ab02c13bfd0b63a ("xfrm: xfrm_state_mtu should return at least 1280 for ipv6") in v5.14 breaks the TCP MSS calculation in ipsec transport mode, resulting complete stalls of TCP connections. This happens when the (P)MTU is 1280 or slighly larger. The desired formula for the MSS is: MSS = (MTU - ESP_overhead) - IP header - TCP header However, the above commit clamps the (MTU - ESP_overhead) to a minimum of 1280, turning the formula into MSS = max(MTU - ESP overhead, 1280) - IP header - TCP header With the (P)MTU near 1280, the calculated MSS is too large and the resulting TCP packets never make it to the destination because they are over the actual PMTU. The above commit also causes suboptimal double fragmentation in xfrm tunnel mode, as described in https://lore.kernel.org/netdev/[email protected]/ The original problem the above commit was trying to fix is now fixed by commit 6596a0229541270fb8d38d989f91b78838e5e9da ("xfrm: fix MTU regression"). Signed-off-by: Jiri Bohac <[email protected]> Signed-off-by: Steffen Klassert <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Sunil V L <[email protected]> Date: Fri Jan 28 10:20:04 2022 +0530 riscv/efi_stub: Fix get_boot_hartid_from_fdt() return value commit dcf0c838854c86e1f41fb1934aea906845d69782 upstream. The get_boot_hartid_from_fdt() function currently returns U32_MAX for failure case which is not correct because U32_MAX is a valid hartid value. This patch fixes the issue by returning error code. Cc: <[email protected]> Fixes: d7071743db31 ("RISC-V: Add EFI stub support.") Signed-off-by: Sunil V L <[email protected]> Reviewed-by: Heinrich Schuchardt <[email protected]> Signed-off-by: Ard Biesheuvel <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Alexandre Ghiti <[email protected]> Date: Fri Feb 25 13:39:51 2022 +0100 riscv: Fix config KASAN && DEBUG_VIRTUAL commit c648c4bb7d02ceb53ee40172fdc4433b37cee9c6 upstream. __virt_to_phys function is called very early in the boot process (ie kasan_early_init) so it should not be instrumented by KASAN otherwise it bugs. Fix this by declaring phys_addr.c as non-kasan instrumentable. Signed-off-by: Alexandre Ghiti <[email protected]> Fixes: 8ad8b72721d0 (riscv: Add KASAN support) Cc: [email protected] Signed-off-by: Palmer Dabbelt <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Alexandre Ghiti <[email protected]> Date: Fri Feb 25 13:39:49 2022 +0100 riscv: Fix config KASAN && SPARSEMEM && !SPARSE_VMEMMAP commit a3d328037846d013bb4c7f3777241e190e4c75e1 upstream. In order to get the pfn of a struct page* when sparsemem is enabled without vmemmap, the mem_section structures need to be initialized which happens in sparse_init. But kasan_early_init calls pfn_to_page way before sparse_init is called, which then tries to dereference a null mem_section pointer. Fix this by removing the usage of this function in kasan_early_init. Fixes: 8ad8b72721d0 ("riscv: Add KASAN support") Signed-off-by: Alexandre Ghiti <[email protected]> Cc: [email protected] Signed-off-by: Palmer Dabbelt <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Heiko Carstens <[email protected]> Date: Thu Feb 24 22:03:29 2022 +0100 s390/extable: fix exception table sorting commit c194dad21025dfd043210912653baab823bdff67 upstream. s390 has a swap_ex_entry_fixup function, however it is not being used since common code expects a swap_ex_entry_fixup define. If it is not defined the default implementation will be used. So fix this by adding a proper define. However also the implementation of the function must be fixed, since a NULL value for handler has a special meaning and must not be adjusted. Luckily all of this doesn't fix a real bug currently: the main extable is correctly sorted during build time, and for runtime sorting there is currently no case where the handler field is not NULL. Fixes: 05a68e892e89 ("s390/kernel: expand exception table logic to allow new handling options") Acked-by: Ilya Leoshkevich <[email protected]> Reviewed-by: Alexander Gordeev <[email protected]> Signed-off-by: Heiko Carstens <[email protected]> Signed-off-by: Vasily Gorbik <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Heiko Carstens <[email protected]> Date: Tue Feb 22 14:53:47 2022 +0100 s390/ftrace: fix arch_ftrace_get_regs implementation commit 1389f17937a03fe4ec71b094e1aa6530a901963e upstream. arch_ftrace_get_regs is supposed to return a struct pt_regs pointer only if the pt_regs structure contains all register contents, which means it must have been populated when created via ftrace_regs_caller. If it was populated via ftrace_caller the contents are not complete (the psw mask part is missing), and therefore a NULL pointer needs be returned. The current code incorrectly always returns a struct pt_regs pointer. Fix this by adding another pt_regs flag which indicates if the contents are complete, and fix arch_ftrace_get_regs accordingly. Fixes: 894979689d3a ("s390/ftrace: provide separate ftrace_caller/ftrace_regs_caller implementations") Reported-by: Christophe Leroy <[email protected]> Reported-by: Naveen N. Rao <[email protected]> Reviewed-by: Sven Schnelle <[email protected]> Acked-by: Ilya Leoshkevich <[email protected]> Signed-off-by: Heiko Carstens <[email protected]> Signed-off-by: Vasily Gorbik <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Heiko Carstens <[email protected]> Date: Wed Feb 23 13:02:59 2022 +0100 s390/ftrace: fix ftrace_caller/ftrace_regs_caller generation commit 9fa881f7e3c74ce6626d166bca9397e5d925937f upstream. ftrace_caller was used for both ftrace_caller and ftrace_regs_caller, which means that the target address of the hotpatch trampoline was never updated. With commit 894979689d3a ("s390/ftrace: provide separate ftrace_caller/ftrace_regs_caller implementations") a separate ftrace_regs_caller entry point was implemeted, however it was forgotten to implement the necessary changes for ftrace_modify_call and ftrace_make_call, where the branch target has to be modified accordingly. Therefore add the missing code now. Fixes: 894979689d3a ("s390/ftrace: provide separate ftrace_caller/ftrace_regs_caller implementations") Reviewed-by: Sven Schnelle <[email protected]> Acked-by: Ilya Leoshkevich <[email protected]> Signed-off-by: Heiko Carstens <[email protected]> Signed-off-by: Vasily Gorbik <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Alexander Egorenkov <[email protected]> Date: Wed Feb 9 11:25:09 2022 +0100 s390/setup: preserve memory at OLDMEM_BASE and OLDMEM_SIZE commit 6b4b54c7ca347bcb4aa7a3cc01aa16e84ac7fbe4 upstream. We need to preserve the values at OLDMEM_BASE and OLDMEM_SIZE which are used by zgetdump in case when kdump crashes. In that case zgetdump will attempt to read OLDMEM_BASE and OLDMEM_SIZE in order to find out where the memory range [0 - OLDMEM_SIZE] belonging to the production kernel is. Fixes: f1a546947431 ("s390/setup: don't reserve memory that occupied decompressor's head") Cc: [email protected] # 5.15+ Signed-off-by: Alexander Egorenkov <[email protected]> Acked-by: Vasily Gorbik <[email protected]> Signed-off-by: Vasily Gorbik <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Tadeusz Struk <[email protected]> Date: Thu Feb 3 08:18:46 2022 -0800 sched/fair: Fix fault in reweight_entity [ Upstream commit 13765de8148f71fa795e0a6607de37c49ea5915a ] Syzbot found a GPF in reweight_entity. This has been bisected to commit 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid sched_task_group") There is a race between sched_post_fork() and setpriority(PRIO_PGRP) within a thread group that causes a null-ptr-deref in reweight_entity() in CFS. The scenario is that the main process spawns number of new threads, which then call setpriority(PRIO_PGRP, 0, -20), wait, and exit. For each of the new threads the copy_process() gets invoked, which adds the new task_struct and calls sched_post_fork() for it. In the above scenario there is a possibility that setpriority(PRIO_PGRP) and set_one_prio() will be called for a thread in the group that is just being created by copy_process(), and for which the sched_post_fork() has not been executed yet. This will trigger a null pointer dereference in reweight_entity(), as it will try to access the run queue pointer, which hasn't been set. Before the mentioned change the cfs_rq pointer for the task has been set in sched_fork(), which is called much earlier in copy_process(), before the new task is added to the thread_group. Now it is done in the sched_post_fork(), which is called after that. To fix the issue the remove the update_load param from the update_load param() function and call reweight_task() only if the task flag doesn't have the TASK_NEW flag set. Fixes: 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid sched_task_group") Reported-by: [email protected] Signed-off-by: Tadeusz Struk <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Dietmar Eggemann <[email protected]> Cc: [email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sasha Levin <[email protected]>
Author: Peter Zijlstra <[email protected]> Date: Mon Feb 14 10:16:57 2022 +0100 sched: Fix yet more sched_fork() races commit b1e8206582f9d680cff7d04828708c8b6ab32957 upstream. Where commit 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid sched_task_group") fixed a fork race vs cgroup, it opened up a race vs syscalls by not placing the task on the runqueue before it gets exposed through the pidhash. Commit 13765de8148f ("sched/fair: Fix fault in reweight_entity") is trying to fix a single instance of this, instead fix the whole class of issues, effectively reverting this commit. Fixes: 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid sched_task_group") Reported-by: Linus Torvalds <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Tadeusz Struk <[email protected]> Tested-by: Zhang Qiao <[email protected]> Tested-by: Dietmar Eggemann <[email protected]> Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Krzysztof Kozlowski <[email protected]> Date: Mon Feb 14 09:36:57 2022 +0100 selftests/ftrace: Do not trace do_softirq because of PREEMPT_RT [ Upstream commit 6fec1ab67f8d60704cc7de64abcfd389ab131542 ] The PREEMPT_RT patchset does not use do_softirq() function thus trying to filter for do_softirq fails for such kernel: echo do_softirq ftracetest: 81: echo: echo: I/O error Choose some other visible function for the test. The function does not have to be actually executed during the test, because it is only testing filter API interface. Signed-off-by: Krzysztof Kozlowski <[email protected]> Reviewed-by: Shuah Khan <[email protected]> Acked-by: Sebastian Andrzej Siewior <[email protected]> Reviewed-by: Steven Rostedt (Google) <[email protected]> Signed-off-by: Shuah Khan <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Sherry Yang <[email protected]> Date: Thu Feb 10 12:30:49 2022 -0800 selftests/seccomp: Fix seccomp failure by adding missing headers [ Upstream commit 21bffcb76ee2fbafc7d5946cef10abc9df5cfff7 ] seccomp_bpf failed on tests 47 global.user_notification_filter_empty and 48 global.user_notification_filter_empty_threaded when it's tested on updated kernel but with old kernel headers. Because old kernel headers don't have definition of macro __NR_clone3 which is required for these two tests. Since under selftests/, we can install headers once for all tests (the default INSTALL_HDR_PATH is usr/include), fix it by adding usr/include to the list of directories to be searched. Use "-isystem" to indicate it's a system directory as the real kernel headers directories are. Signed-off-by: Sherry Yang <[email protected]> Tested-by: Sherry Yang <[email protected]> Reviewed-by: Kees Cook <[email protected]> Signed-off-by: Shuah Khan <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Amit Cohen <[email protected]> Date: Wed Mar 2 18:14:47 2022 +0200 selftests: mlxsw: resource_scale: Fix return value [ Upstream commit 196f9bc050cbc5085b4cbb61cce2efe380bc66d0 ] The test runs several test cases and is supposed to return an error in case at least one of them failed. Currently, the check of the return value of each test case is in the wrong place, which can result in the wrong return value. For example: # TESTS='tc_police' ./resource_scale.sh TEST: 'tc_police' [default] 968 [FAIL] tc police offload count failed Error: mlxsw_spectrum: Failed to allocate policer index. We have an error talking to the kernel Command failed /tmp/tmp.i7Oc5HwmXY:969 TEST: 'tc_police' [default] overflow 969 [ OK ] ... TEST: 'tc_police' [ipv4_max] overflow 969 [ OK ] $ echo $? 0 Fix this by moving the check to be done after each test case. Fixes: 059b18e21c63 ("selftests: mlxsw: Return correct error code in resource scale test") Signed-off-by: Amit Cohen <[email protected]> Reviewed-by: Petr Machata <[email protected]> Signed-off-by: Ido Schimmel <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Amit Cohen <[email protected]> Date: Wed Mar 2 18:14:46 2022 +0200 selftests: mlxsw: tc_police_scale: Make test more robust commit dc9752075341e7beb653e37c6f4a3723074dc8bc upstream. The test adds tc filters and checks how many of them were offloaded by grepping for 'in_hw'. iproute2 commit f4cd4f127047 ("tc: add skip_hw and skip_sw to control action offload") added offload indication to tc actions, producing the following output: $ tc filter show dev swp2 ingress ... filter protocol ipv6 pref 1000 flower chain 0 handle 0x7c0 eth_type ipv6 dst_ip 2001:db8:1::7bf skip_sw in_hw in_hw_count 1 action order 1: police 0x7c0 rate 10Mbit burst 100Kb mtu 2Kb action drop overhead 0b ref 1 bind 1 not_in_hw used_hw_stats immediate The current grep expression matches on both 'in_hw' and 'not_in_hw', resulting in incorrect results. Fix that by using JSON output instead. Fixes: 5061e773264b ("selftests: mlxsw: Add scale test for tc-police") Signed-off-by: Amit Cohen <[email protected]> Reviewed-by: Petr Machata <[email protected]> Signed-off-by: Ido Schimmel <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Valentin Caron <[email protected]> Date: Tue Jan 11 17:44:40 2022 +0100 serial: stm32: prevent TDR register overwrite when sending x_char [ Upstream commit d3d079bde07e1b7deaeb57506dc0b86010121d17 ] When sending x_char in stm32_usart_transmit_chars(), driver can overwrite the value of TDR register by the value of x_char. If this happens, the previous value that was present in TDR register will not be sent through uart. This code checks if the previous value in TDR register is sent before writing the x_char value into register. Fixes: 48a6092fb41f ("serial: stm32-usart: Add STM32 USART Driver") Cc: stable <[email protected]> Signed-off-by: Valentin Caron <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Christophe JAILLET <[email protected]> Date: Wed Nov 3 21:00:33 2021 +0100 soc: fsl: guts: Add a missing memory allocation failure check [ Upstream commit b9abe942cda43a1d46a0fd96efb54f1aa909f757 ] If 'devm_kstrdup()' fails, we should return -ENOMEM. While at it, move the 'of_node_put()' call in the error handling path and after the 'machine' has been copied. Better safe than sorry. Fixes: a6fc3b698130 ("soc: fsl: add GUTS driver for QorIQ platforms") Depends-on: fddacc7ff4dd ("soc: fsl: guts: Revert commit 3c0d64e867ed") Suggested-by: Tyrel Datwyler <[email protected]> Signed-off-by: Christophe JAILLET <[email protected]> Signed-off-by: Li Yang <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Christophe JAILLET <[email protected]> Date: Wed Nov 3 21:00:17 2021 +0100 soc: fsl: guts: Revert commit 3c0d64e867ed [ Upstream commit b113737cf12964a20cc3ba1ddabe6229099661c6 ] This reverts commit 3c0d64e867ed ("soc: fsl: guts: reuse machine name from device tree"). A following patch will fix the missing memory allocation failure check instead. Suggested-by: Tyrel Datwyler <[email protected]> Signed-off-by: Christophe JAILLET <[email protected]> Signed-off-by: Li Yang <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Jiasheng Jiang <[email protected]> Date: Thu Dec 30 09:45:43 2021 +0800 soc: fsl: qe: Check of ioremap return value [ Upstream commit a222fd8541394b36b13c89d1698d9530afd59a9c ] As the possible failure of the ioremap(), the par_io could be NULL. Therefore it should be better to check it and return error in order to guarantee the success of the initiation. But, I also notice that all the caller like mpc85xx_qe_par_io_init() in `arch/powerpc/platforms/85xx/common.c` don't check the return value of the par_io_init(). Actually, par_io_init() needs to check to handle the potential error. I will submit another patch to fix that. Anyway, par_io_init() itsely should be fixed. Fixes: 7aa1aa6ecec2 ("QE: Move QE from arch/powerpc to drivers/soc") Signed-off-by: Jiasheng Jiang <[email protected]> Signed-off-by: Li Yang <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Laurent Pinchart <[email protected]> Date: Fri Feb 18 23:57:20 2022 +0200 soc: imx: gpcv2: Fix clock disabling imbalance in error path [ Upstream commit fa231bef3b34f1670b240409c11e59a3ce095e6d ] The imx_pgc_power_down() starts by enabling the domain clocks, and thus disables them in the error path. Commit 18c98573a4cf ("soc: imx: gpcv2: add domain option to keep domain clocks enabled") made the clock enable conditional, but forgot to add the same condition to the error path. This can result in a clock enable/disable imbalance. Fix it. Fixes: 18c98573a4cf ("soc: imx: gpcv2: add domain option to keep domain clocks enabled") Signed-off-by: Laurent Pinchart <[email protected]> Reviewed-by: Lucas Stach <[email protected]> Signed-off-by: Shawn Guo <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Nicolas Cavallari <[email protected]> Date: Mon Feb 28 12:03:51 2022 +0100 thermal: core: Fix TZ_GET_TRIP NULL pointer dereference commit 5838a14832d447990827d85e90afe17e6fb9c175 upstream. Do not call get_trip_hyst() from thermal_genl_cmd_tz_get_trip() if the thermal zone does not define one. Fixes: 1ce50e7d408e ("thermal: core: genetlink support for events/cmd/sampling") Signed-off-by: Nicolas Cavallari <[email protected]> Cc: 5.10+ <[email protected]> # 5.10+ Signed-off-by: Rafael J. Wysocki <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Hangyu Hua <[email protected]> Date: Fri Feb 11 12:55:10 2022 +0800 tipc: fix a bit overflow in tipc_crypto_key_rcv() [ Upstream commit 143de8d97d79316590475dc2a84513c63c863ddf ] msg_data_sz return a 32bit value, but size is 16bit. This may lead to a bit overflow. Signed-off-by: Hangyu Hua <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Steven Rostedt (Google) <[email protected]> Date: Tue Mar 1 22:29:04 2022 -0500 tracing/histogram: Fix sorting on old "cpu" value commit 1d1898f65616c4601208963c3376c1d828cbf2c7 upstream. When trying to add a histogram against an event with the "cpu" field, it was impossible due to "cpu" being a keyword to key off of the running CPU. So to fix this, it was changed to "common_cpu" to match the other generic fields (like "common_pid"). But since some scripts used "cpu" for keying off of the CPU (for events that did not have "cpu" as a field, which is most of them), a backward compatibility trick was added such that if "cpu" was used as a key, and the event did not have "cpu" as a field name, then it would fallback and switch over to "common_cpu". This fix has a couple of subtle bugs. One was that when switching over to "common_cpu", it did not change the field name, it just set a flag. But the code still found a "cpu" field. The "cpu" field is used for filtering and is returned when the event does not have a "cpu" field. This was found by: # cd /sys/kernel/tracing # echo hist:key=cpu,pid:sort=cpu > events/sched/sched_wakeup/trigger # cat events/sched/sched_wakeup/hist Which showed the histogram unsorted: { cpu: 19, pid: 1175 } hitcount: 1 { cpu: 6, pid: 239 } hitcount: 2 { cpu: 23, pid: 1186 } hitcount: 14 { cpu: 12, pid: 249 } hitcount: 2 { cpu: 3, pid: 994 } hitcount: 5 Instead of hard coding the "cpu" checks, take advantage of the fact that trace_event_field_field() returns a special field for "cpu" and "CPU" if the event does not have "cpu" as a field. This special field has the "filter_type" of "FILTER_CPU". Check that to test if the returned field is of the CPU type instead of doing the string compare. Also, fix the sorting bug by testing for the hist_field flag of HIST_FIELD_FL_CPU when setting up the sort routine. Otherwise it will use the special CPU field to know what compare routine to use, and since that special field does not have a size, it returns tracing_map_cmp_none. Cc: [email protected] Fixes: 1e3bac71c505 ("tracing/histogram: Rename "cpu" to "common_cpu"") Reported-by: Daniel Bristot de Oliveira <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Steven Rostedt <[email protected]> Date: Mon Jan 10 11:55:32 2022 -0500 tracing: Add test for user space strings when filtering on string pointers [ Upstream commit 77360f9bbc7e5e2ab7a2c8b4c0244fbbfcfc6f62 ] Pingfan reported that the following causes a fault: echo "filename ~ \"cpu\"" > events/syscalls/sys_enter_openat/filter echo 1 > events/syscalls/sys_enter_at/enable The reason is that trace event filter treats the user space pointer defined by "filename" as a normal pointer to compare against the "cpu" string. The following bug happened: kvm-03-guest16 login: [72198.026181] BUG: unable to handle page fault for address: 00007fffaae8ef60 #PF: supervisor read access in kernel mode #PF: error_code(0x0001) - permissions violation PGD 80000001008b7067 P4D 80000001008b7067 PUD 2393f1067 PMD 2393ec067 PTE 8000000108f47867 Oops: 0001 [#1] PREEMPT SMP PTI CPU: 1 PID: 1 Comm: systemd Kdump: loaded Not tainted 5.14.0-32.el9.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:strlen+0x0/0x20 Code: 48 89 f9 74 09 48 83 c1 01 80 39 00 75 f7 31 d2 44 0f b6 04 16 44 88 04 11 48 83 c2 01 45 84 c0 75 ee c3 0f 1f 80 00 00 00 00 <80> 3f 00 74 10 48 89 f8 48 83 c0 01 80 38 00 75 f7 48 29 f8 c3 31 RSP: 0018:ffffb5b900013e48 EFLAGS: 00010246 RAX: 0000000000000018 RBX: ffff8fc1c49ede00 RCX: 0000000000000000 RDX: 0000000000000020 RSI: ffff8fc1c02d601c RDI: 00007fffaae8ef60 RBP: 00007fffaae8ef60 R08: 0005034f4ddb8ea4 R09: 0000000000000000 R10: ffff8fc1c02d601c R11: 0000000000000000 R12: ffff8fc1c8a6e380 R13: 0000000000000000 R14: ffff8fc1c02d6010 R15: ffff8fc1c00453c0 FS: 00007fa86123db40(0000) GS:ffff8fc2ffd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fffaae8ef60 CR3: 0000000102880001 CR4: 00000000007706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: filter_pred_pchar+0x18/0x40 filter_match_preds+0x31/0x70 ftrace_syscall_enter+0x27a/0x2c0 syscall_trace_enter.constprop.0+0x1aa/0x1d0 do_syscall_64+0x16/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7fa861d88664 The above happened because the kernel tried to access user space directly and triggered a "supervisor read access in kernel mode" fault. Worse yet, the memory could not even be loaded yet, and a SEGFAULT could happen as well. This could be true for kernel space accessing as well. To be even more robust, test both kernel and user space strings. If the string fails to read, then simply have the filter fail. Note, TASK_SIZE is used to determine if the pointer is user or kernel space and the appropriate strncpy_from_kernel/user_nofault() function is used to copy the memory. For some architectures, the compare to TASK_SIZE may always pick user space or kernel space. If it gets it wrong, the only thing is that the filter will fail to match. In the future, this needs to be fixed to have the event denote which should be used. But failing a filter is much better than panicing the machine, and that can be solved later. Link: https://lore.kernel.org/all/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Cc: [email protected] Cc: Ingo Molnar <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Tom Zanussi <[email protected]> Reported-by: Pingfan Liu <[email protected]> Tested-by: Pingfan Liu <[email protected]> Fixes: 87a342f5db69d ("tracing/filters: Support filtering for char * strings") Signed-off-by: Steven Rostedt <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Steven Rostedt <[email protected]> Date: Thu Jan 13 20:08:40 2022 -0500 tracing: Add ustring operation to filtering string pointers [ Upstream commit f37c3bbc635994eda203a6da4ba0f9d05165a8d6 ] Since referencing user space pointers is special, if the user wants to filter on a field that is a pointer to user space, then they need to specify it. Add a ".ustring" attribute to the field name for filters to state that the field is pointing to user space such that the kernel can take the appropriate action to read that pointer. Link: https://lore.kernel.org/all/[email protected]/ Fixes: 77360f9bbc7e ("tracing: Add test for user space strings when filtering on string pointers") Tested-by: Sven Schnelle <[email protected]> Signed-off-by: Steven Rostedt <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Author: Randy Dunlap <[email protected]> Date: Wed Mar 2 19:17:44 2022 -0800 tracing: Fix return value of __setup handlers commit 1d02b444b8d1345ea4708db3bab4db89a7784b55 upstream. __setup() handlers should generally return 1 to indicate that the boot options have been handled. Using invalid option values causes the entire kernel boot option string to be reported as Unknown and added to init's environment strings, polluting it. Unknown kernel command line parameters "BOOT_IMAGE=/boot/bzImage-517rc6 kprobe_event=p,syscall_any,$arg1 trace_options=quiet trace_clock=jiffies", will be passed to user space. Run /sbin/init as init process with arguments: /sbin/init with environment: HOME=/ TERM=linux BOOT_IMAGE=/boot/bzImage-517rc6 kprobe_event=p,syscall_any,$arg1 trace_options=quiet trace_clock=jiffies Return 1 from the __setup() handlers so that init's environment is not polluted with kernel boot options. Link: lore.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Cc: [email protected] Fixes: 7bcfaf54f591 ("tracing: Add trace_options kernel command line parameter") Fixes: e1e232ca6b8f ("tracing: Add trace_clock=<clock> kernel parameter") Fixes: 970988e19eb0 ("tracing/kprobe: Add kprobe_event= boot parameter") Signed-off-by: Randy Dunlap <[email protected]> Reported-by: Igor Zhbanov <[email protected]> Acked-by: Masami Hiramatsu <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Eric W. Biederman <[email protected]> Date: Thu Feb 24 08:32:28 2022 -0600 ucounts: Fix systemd LimitNPROC with private users regression commit 0ac983f512033cb7b5e210c9589768ad25b1e36b upstream. Long story short recursively enforcing RLIMIT_NPROC when it is not enforced on the process that creates a new user namespace, causes currently working code to fail. There is no reason to enforce RLIMIT_NPROC recursively when we don't enforce it normally so update the code to detect this case. I would like to simply use capable(CAP_SYS_RESOURCE) to detect when RLIMIT_NPROC is not enforced upon the caller. Unfortunately because RLIMIT_NPROC is charged and checked for enforcement based upon the real uid, using capable() which is euid based is inconsistent with reality. Come as close as possible to testing for capable(CAP_SYS_RESOURCE) by testing for when the real uid would match the conditions when CAP_SYS_RESOURCE would be present if the real uid was the effective uid. Reported-by: Etienne Dechamps <[email protected]> Link: https://bugzilla.kernel.org/show_bug.cgi?id=215596 Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Cc: [email protected] Fixes: 21d1c5e386bc ("Reimplement RLIMIT_NPROC on top of ucounts") Reviewed-by: Kees Cook <[email protected]> Signed-off-by: "Eric W. Biederman" <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Hangyu Hua <[email protected]> Date: Sat Jan 1 01:21:38 2022 +0800 usb: gadget: clear related members when goto fail commit 501e38a5531efbd77d5c73c0ba838a889bfc1d74 upstream. dev->config and dev->hs_config and dev->dev need to be cleaned if dev_config fails to avoid UAF. Acked-by: Alan Stern <[email protected]> Signed-off-by: Hangyu Hua <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Hangyu Hua <[email protected]> Date: Sat Jan 1 01:21:37 2022 +0800 usb: gadget: don't release an existing dev->buf commit 89f3594d0de58e8a57d92d497dea9fee3d4b9cda upstream. dev->buf does not need to be released if it already exists before executing dev_config. Acked-by: Alan Stern <[email protected]> Signed-off-by: Hangyu Hua <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Dexuan Cui <[email protected]> Date: Fri Feb 25 00:46:00 2022 -0800 x86/kvmclock: Fix Hyper-V Isolated VM's boot issue when vCPUs > 64 commit 92e68cc558774de01024c18e8b35cdce4731c910 upstream. When Linux runs as an Isolated VM on Hyper-V, it supports AMD SEV-SNP but it's partially enlightened, i.e. cc_platform_has( CC_ATTR_GUEST_MEM_ENCRYPT) is true but sev_active() is false. Commit 4d96f9109109 per se is good, but with it now kvm_setup_vsyscall_timeinfo() -> kvmclock_init_mem() calls set_memory_decrypted(), and later gets stuck when trying to zere out the pages pointed by 'hvclock_mem', if Linux runs as an Isolated VM on Hyper-V. The cause is that here now the Linux VM should no longer access the original guest physical addrss (GPA); instead the VM should do memremap() and access the original GPA + ms_hyperv.shared_gpa_boundary: see the example code in drivers/hv/connection.c: vmbus_connect() or drivers/hv/ring_buffer.c: hv_ringbuffer_init(). If the VM tries to access the original GPA, it keepts getting injected a fault by Hyper-V and gets stuck there. Here the issue happens only when the VM has >=65 vCPUs, because the global static array hv_clock_boot[] can hold 64 "struct pvclock_vsyscall_time_info" (the sizeof of the struct is 64 bytes), so kvmclock_init_mem() only allocates memory in the case of vCPUs > 64. Since the 'hvclock_mem' pages are only useful when the kvm clock is supported by the underlying hypervisor, fix the issue by returning early when Linux VM runs on Hyper-V, which doesn't support kvm clock. Fixes: 4d96f9109109 ("x86/sev: Replace occurrences of sev_active() with cc_platform_has()") Tested-by: Andrea Parri (Microsoft) <[email protected]> Signed-off-by: Andrea Parri (Microsoft) <[email protected]> Signed-off-by: Dexuan Cui <[email protected]> Message-Id: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Marek Marczykowski-Górecki <[email protected]> Date: Wed Feb 23 22:19:54 2022 +0100 xen/netfront: destroy queues before real_num_tx_queues is zeroed commit dcf4ff7a48e7598e6b10126cc02177abb8ae4f3f upstream. xennet_destroy_queues() relies on info->netdev->real_num_tx_queues to delete queues. Since d7dac083414eb5bb99a6d2ed53dc2c1b405224e5 ("net-sysfs: update the queue counts in the unregistration path"), unregister_netdev() indirectly sets real_num_tx_queues to 0. Those two facts together means, that xennet_destroy_queues() called from xennet_remove() cannot do its job, because it's called after unregister_netdev(). This results in kfree-ing queues that are still linked in napi, which ultimately crashes: BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 1 PID: 52 Comm: xenwatch Tainted: G W 5.16.10-1.32.fc32.qubes.x86_64+ #226 RIP: 0010:free_netdev+0xa3/0x1a0 Code: ff 48 89 df e8 2e e9 00 00 48 8b 43 50 48 8b 08 48 8d b8 a0 fe ff ff 48 8d a9 a0 fe ff ff 49 39 c4 75 26 eb 47 e8 ed c1 66 ff <48> 8b 85 60 01 00 00 48 8d 95 60 01 00 00 48 89 ef 48 2d 60 01 00 RSP: 0000:ffffc90000bcfd00 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff88800edad000 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffffc90000bcfc30 RDI: 00000000ffffffff RBP: fffffffffffffea0 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000001 R12: ffff88800edad050 R13: ffff8880065f8f88 R14: 0000000000000000 R15: ffff8880066c6680 FS: 0000000000000000(0000) GS:ffff8880f3300000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 00000000e998c006 CR4: 00000000003706e0 Call Trace: <TASK> xennet_remove+0x13d/0x300 [xen_netfront] xenbus_dev_remove+0x6d/0xf0 __device_release_driver+0x17a/0x240 device_release_driver+0x24/0x30 bus_remove_device+0xd8/0x140 device_del+0x18b/0x410 ? _raw_spin_unlock+0x16/0x30 ? klist_iter_exit+0x14/0x20 ? xenbus_dev_request_and_reply+0x80/0x80 device_unregister+0x13/0x60 xenbus_dev_changed+0x18e/0x1f0 xenwatch_thread+0xc0/0x1a0 ? do_wait_intr_irq+0xa0/0xa0 kthread+0x16b/0x190 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x22/0x30 </TASK> Fix this by calling xennet_destroy_queues() from xennet_uninit(), when real_num_tx_queues is still available. This ensures that queues are destroyed when real_num_tx_queues is set to 0, regardless of how unregister_netdev() was called. Originally reported at https://github.com/QubesOS/qubes-issues/issues/7257 Fixes: d7dac083414eb5bb9 ("net-sysfs: update the queue counts in the unregistration path") Cc: [email protected] Signed-off-by: Marek Marczykowski-Górecki <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Leon Romanovsky <[email protected]> Date: Tue Feb 8 16:14:32 2022 +0200 xfrm: enforce validity of offload input flags commit 7c76ecd9c99b6e9a771d813ab1aa7fa428b3ade1 upstream. struct xfrm_user_offload has flags variable that received user input, but kernel didn't check if valid bits were provided. It caused a situation where not sanitized input was forwarded directly to the drivers. For example, XFRM_OFFLOAD_IPV6 define that was exposed, was used by strongswan, but not implemented in the kernel at all. As a solution, check and sanitize input flags to forward XFRM_OFFLOAD_INBOUND to the drivers. Fixes: d77e38e612a0 ("xfrm: Add an IPsec hardware offloading API") Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Steffen Klassert <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Jiri Bohac <[email protected]> Date: Wed Jan 19 10:22:53 2022 +0100 xfrm: fix MTU regression commit 6596a0229541270fb8d38d989f91b78838e5e9da upstream. Commit 749439bfac6e1a2932c582e2699f91d329658196 ("ipv6: fix udpv6 sendmsg crash caused by too small MTU") breaks PMTU for xfrm. A Packet Too Big ICMPv6 message received in response to an ESP packet will prevent all further communication through the tunnel if the reported MTU minus the ESP overhead is smaller than 1280. E.g. in a case of a tunnel-mode ESP with sha256/aes the overhead is 92 bytes. Receiving a PTB with MTU of 1371 or less will result in all further packets in the tunnel dropped. A ping through the tunnel fails with "ping: sendmsg: Invalid argument". Apparently the MTU on the xfrm route is smaller than 1280 and fails the check inside ip6_setup_cork() added by 749439bf. We found this by debugging USGv6/ipv6ready failures. Failing tests are: "Phase-2 Interoperability Test Scenario IPsec" / 5.3.11 and 5.4.11 (Tunnel Mode: Fragmentation). Commit b515d2637276a3810d6595e10ab02c13bfd0b63a ("xfrm: xfrm_state_mtu should return at least 1280 for ipv6") attempted to fix this but caused another regression in TCP MSS calculations and had to be reverted. The patch below fixes the situation by dropping the MTU check and instead checking for the underflows described in the 749439bf commit message. Signed-off-by: Jiri Bohac <[email protected]> Fixes: 749439bfac6e ("ipv6: fix udpv6 sendmsg crash caused by too small MTU") Signed-off-by: Steffen Klassert <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
Author: Antony Antony <[email protected]> Date: Tue Feb 1 07:51:57 2022 +0100 xfrm: fix the if_id check in changelink commit 6d0d95a1c2b07270870e7be16575c513c29af3f1 upstream. if_id will be always 0, because it was not yet initialized. Fixes: 8dce43919566 ("xfrm: interface with if_id 0 should return error") Reported-by: Pavel Machek <[email protected]> Signed-off-by: Antony Antony <[email protected]> Signed-off-by: Steffen Klassert <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>