Add unwinding from irq context tests. Unwinder should be able to unwind
through irq stack to task stack up to task pt_regs.
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Add stack name, sp and reliable information into test unwinding
results. Also consider ip outside of kernel text as failure if the
state is reported reliable.
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Add CALL_ON_STACK helper testing. Tests make sure that we can unwind from
switched stack to original one up to task pt_regs (nodat -> task stack).
UWM_SWITCH_STACK could not be used together with UWM_THREAD because
get_stack_info explicitly restricts unwinding to task stack if
task != current.
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
CALL_ON_STACK defines and initializes register variables. Inline
assembly which follows might trigger compiler to generate memory access
for "stack" argument (e.g. in case of S390_lowcore.nodat_stack). This
memory access produces a function call under kasan with outline
instrumentation which clobbers registers.
Switch "stack" argument in CALL_ON_STACK helper to use memory reference
constraint and perform load instead.
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Currently unwinder test passes if unwinding results contain unwindme_func2
and unwindme_func1 functions.
Now that unwinder reports success upon reaching task pt_regs, check
that unwinding ended successfully in every test.
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
unwind_for_each_frame can take at least 8 different sets of parameters.
Add a test to make sure they all are handled in a sane way.
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Co-developed-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Always inline get_stack_pointer() to avoid potential problems
due to compiler inlining decisions, i.e. getting stack pointer of
get_stack_pointer() itself which is later reused.
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The config option CONFIG_PCI_NR_FUNCTIONS sets a limit on the number of
PCI functions we can support. Previously on reaching this limit there
was no indication why newly attached devices are not recognized by Linux
which could be quite confusing. Thus this patch adds a pr_err() for this
case.
Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
When UID checking was turned off during runtime in the underlying
hypervisor, a PCI device may be attached with the same UID. This is
already detected but happens silently. Add an error message so it can
more easily be understood why a device was not added.
Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Each SBDT is located at a 4KB page and contains 512 entries.
Each entry of a SDBT points to a SDB, a 4KB page containing
sampled data. The last entry is a link to another SDBT page.
When an event is created the function sequence executed is:
__hw_perf_event_init()
+--> allocate_buffers()
+--> realloc_sampling_buffers()
+---> alloc_sample_data_block()
Both functions realloc_sampling_buffers() and
alloc_sample_data_block() allocate pages and the allocation
can fail. This is handled correctly and all allocated
pages are freed and error -ENOMEM is returned to the
top calling function. Finally the event is not created.
Once the event has been created, the amount of initially
allocated SDBT and SDB can be too low. This is detected
during measurement interrupt handling, where the amount
of lost samples is calculated. If the number of lost samples
is too high considering sampling frequency and already allocated
SBDs, the number of SDBs is enlarged during the next execution
of cpumsf_pmu_enable().
If more SBDs need to be allocated, functions
realloc_sampling_buffers()
+---> alloc-sample_data_block()
are called to allocate more pages. Page allocation may fail
and the returned error is ignored. A SDBT and SDB setup
already exists.
However the modified SDBTs and SDBs might end up in a situation
where the first entry of an SDBT does not point to an SDB,
but another SDBT, basicly an SBDT without payload.
This can not be handled by the interrupt handler, where an SDBT
must have at least one entry pointing to an SBD.
Add a check to avoid SDBTs with out payload (SDBs) when enlarging
the buffer setup.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The macro TEAR_REG() saves the last used SDBT address
in the perf_hw_event structure. This is also done
by function hw_reset_registers() which is a one-liner
and simply uses macro TEAR_REG(). Remove function
hw_reset_registers(), which is only used one time and use
macro TEAR_REG() instead. This macro is used throughout
the code anyway.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
In interrupt handling the function extend_sampling_buffer()
is called after checking for a possibly extension.
This check is not necessary as the called function itself
performs this check again.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Replace hard coded function names in debug statements
by the "%s ...", __func__ construct suggested by checkpatch.pl
script. Use consistent debug print format of the form variable
blank value. Also add leading 0x for all hex values.
Print allocated page addresses consistantly as hex numbers
with leading 0x.
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The KASLR offset is added to vmcoreinfo in arch_crash_save_vmcoreinfo(),
so that it can be found by crash when processing kernel dumps.
However, arch_crash_save_vmcoreinfo() is called during a subsys_initcall,
so if the kernel crashes before that, we have no vmcoreinfo and no KASLR
offset.
Fix this by storing the KASLR offset in the lowcore, where the vmcore_info
pointer will be stored, and where it can be found by crash. In order to
make it distinguishable from a real vmcore_info pointer, mark it as uneven
(KASLR offset itself is aligned to THREAD_SIZE).
When arch_crash_save_vmcoreinfo() stores the real vmcore_info pointer in
the lowcore, it overwrites the KASLR offset. At that point, the KASLR
offset is not yet added to vmcoreinfo, so we also need to move the
mem_assign_absolute() behind the vmcoreinfo_append_str().
Fixes: b2d24b97b2 ("s390/kernel: add support for kernel address space layout randomization (KASLR)")
Cc: <stable@vger.kernel.org> # v5.2+
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Consider reaching task pt_regs graceful unwinder termination. Task
pt_regs itself never contains a valid state to which a task might return
within the kernel context (user task pt_regs is a special case). Since
we already avoid printing user task pt_regs and in most cases we don't
even bother filling task pt_regs psw and r15 with something reasonable
simply skip task pt_regs altogether. With this change unwind_error() now
accurately represent whether unwinder reached task pt_regs successfully
or failed along the way.
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Add missing allocation of pt_regs at the bottom of the stack. This
makes it consistent with other stack setup cases and also what stack
unwinder expects.
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Currently unwinder yields 2 entries when pt_regs are met:
sp="address of pt_regs itself" ip=pt_regs->psw
sp=pt_regs->gprs[15] ip="r14 from stack frame pointed by pt_regs->gprs[15]"
And neither of those 2 states (combination of sp and ip) ever happened.
reuse_sp has been introduced by commit a1d863ac3e ("s390/unwind: fix
mixing regs and sp"). reuse_sp=true makes unwinder keen to produce the
following result, when pt_regs are given (as an arg to unwind_start):
sp=pt_regs->gprs[15] ip=pt_regs->psw
sp=pt_regs->gprs[15] ip="r14 from stack frame pointed by pt_regs->gprs[15]"
The first state is an actual state in which a task was when pt_regs were
collected. The second state is marked unreliable and is for debugging
purposes to cover the case when a task has been interrupted in between
stack frame allocation and writing back_chain - in this case r14 might
show an actual caller.
Make unwinder behaviour enabled via reuse_sp=true default and drop the
special case handling.
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
If unwinder is looking at pt_regs which is not on stack then something
went wrong and an error has to be reported rather than successful
unwinding termination.
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
CALL_ON_STACK is intended to be used for temporary stack switching with
potential return to the caller.
When CALL_ON_STACK is misused to switch from nodat stack to task stack
back_chain information would later lead stack unwinder from task stack into
(per cpu) nodat stack which is reused for other purposes. This would
yield confusing unwinding result or errors.
To avoid that introduce CALL_ON_STACK_NORETURN to be used instead. It
makes sure that back_chain is zeroed and unwinder finishes gracefully
ending up at task pt_regs.
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Currently CALL_ON_STACK saves r15 as back_chain in the first stack frame of
the stack we about to switch to. But if a function which uses CALL_ON_STACK
calls other function it allocates a stack frame for a callee. In this
case r15 is pointing to a callee stack frame and not a stack frame of
function itself. This results in dummy unwinding entry with random
sp and ip values.
Introduce and utilize current_frame_address macro to get an address of
actual function stack frame.
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Avoid mixture of task == NULL and task == current meaning the same
thing and simply always initialize task with current in unwind_start.
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Make sure preemption is disabled when temporary switching to nodat
stack with CALL_ON_STACK helper, because nodat stack is per cpu.
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
disabled_wait uses _THIS_IP_ and assumes that compiler would inline it.
Make sure this assumption is always correct by utilizing __always_inline.
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
getcpu reads the required values for cpu and node with two
instructions. This might lead to an inconsistent result if user space
gets preempted and migrated to a different CPU between the two
instructions.
Fix this by using just a single instruction to read both values at
once.
This is currently rather a theoretical bug, since there is no real
NUMA support available (except for NUMA emulation).
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
When a secondary CPU is brought up it must initialize its control
registers. CPU A which triggers that a secondary CPU B is brought up
stores its control register contents into the lowcore of new CPU B,
which then loads these values on startup.
This is problematic in various ways: the control register which
contains the home space ASCE will correctly contain the kernel ASCE;
however control registers for primary and secondary ASCEs are
initialized with whatever values were present in CPU A.
Typically:
- the primary ASCE will contain the user process ASCE of the process
that triggered onlining of CPU B.
- the secondary ASCE will contain the percpu VDSO ASCE of CPU A.
Due to lazy ASCE handling we may also end up with other combinations.
When then CPU B switches to a different process (!= idle) it will
fixup the primary ASCE. However the problem is that the (wrong) ASCE
from CPU A was loaded into control register 1: as soon as an ASCE is
attached (aka loaded) a CPU is free to generate TLB entries using that
address space.
Even though it is very unlikey that CPU B will actually generate such
entries, this could result in TLB entries of the address space of the
process that ran on CPU A. These entries shouldn't exist at all and
could cause problems later on.
Furthermore the secondary ASCE of CPU B will not be updated correctly.
This means that processes may see wrong results or even crash if they
access VDSO data on CPU B. The correct VDSO ASCE will eventually be
loaded on return to user space as soon as the kernel executed a call
to strnlen_user or an atomic futex operation on CPU B.
Fix both issues by intializing the to be loaded control register
contents with the correct ASCEs and also enforce (re-)loading of the
ASCEs upon first context switch and return to user space.
Fixes: 0aaba41b58 ("s390: remove all code using the access register mode")
Cc: stable@vger.kernel.org # v4.15+
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
This patch introduces support for a new architectured reply
code 0x8B indicating that a hypervisor layer (if any) has
rejected an ap message.
Linux may run as a guest on top of a hypervisor like zVM
or KVM. So the crypto hardware seen by the ap bus may be
restricted by the hypervisor for example only a subset like
only clear key crypto requests may be supported. Other
requests will be filtered out - rejected by the hypervisor.
The new reply code 0x8B will appear in such cases and needs
to get recognized by the ap bus and zcrypt device driver zoo.
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
On s390 bpf_get_stack_raw_tp() returns 0 entries for both kernel and
user stacks. While there is no practical unwinding solution for userspace
on s390 at this moment, there certainly is a kernel unwinder. However,
it is not properly integrated with BPF.
In order to start unwinding, bpf_get_stack_raw_tp() obtains the current
kernel register values using perf_fetch_caller_regs(), which is not
implemented for s390. The actual unwinding then happens by passing those
registers to perf_callchain_kernel().
Implement perf_arch_fetch_caller_regs() for s390, where
__builtin_frame_address(0) points to back_chain.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
system_call assembly code always pushes pointer to struct pt_regs as the
last additional parameter for all system calls. The only user of this
feature is xtensa_rt_sigreturn.
Avoid this special case. Define xtensa_rt_sigreturn as accepting no
argiments. Use current_pt_regs to get pointer to struct pt_regs in
xtensa_rt_sigreturn. Don't pass additional parameter from system_call
code.
Signed-off-by: Max Filippov <jcmvbkbc@gmail.com>
Don't overwrite return value if system call was cancelled at entry by
ptrace. Return status code from do_syscall_trace_enter so that
pt_regs::syscall doesn't need to be changed to skip syscall.
Signed-off-by: Max Filippov <jcmvbkbc@gmail.com>
system_call saves and restores syscall number across system call to make
clone and execv entry and exit tracing match. This complicates things
when syscall code may be changed by ptrace.
Preserve syscall code in copy_thread and start_thread directly instead of
doing tricks in system_call.
Signed-off-by: Max Filippov <jcmvbkbc@gmail.com>
Some versions of iproute2 will output more than one line per entry, which
will cause the test to fail, like:
TEST: ipv6: list and flush cached exceptions [FAIL]
can't list cached exceptions
That happens, for example, with iproute2 4.15.0. When using the -oneline
option, this will work just fine:
TEST: ipv6: list and flush cached exceptions [ OK ]
This also works just fine with a more recent version of iproute2, like
5.4.0.
For some reason, two lines are printed for the IPv4 test no matter what
version of iproute2 is used. Use the same -oneline parameter there instead
of counting the lines twice.
Fixes: b964641e99 ("selftests: pmtu: Make list_flush_ipv6_exception test more demanding")
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Acked-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Christophe reports that current master fails building on powerpc with
this error:
CC fs/io_uring.o
fs/io_uring.c: In function ‘loop_rw_iter’:
fs/io_uring.c:1628:21: error: implicit declaration of function ‘kmap’
[-Werror=implicit-function-declaration]
iovec.iov_base = kmap(iter->bvec->bv_page)
^
fs/io_uring.c:1628:19: warning: assignment makes pointer from integer
without a cast [-Wint-conversion]
iovec.iov_base = kmap(iter->bvec->bv_page)
^
fs/io_uring.c:1643:4: error: implicit declaration of function ‘kunmap’
[-Werror=implicit-function-declaration]
kunmap(iter->bvec->bv_page);
^
which is caused by a missing highmem.h include. Fix it by including
it.
Fixes: 311ae9e159 ("io_uring: fix dead-hung for non-iter fixed rw")
Reported-by: Christophe Leroy <christophe.leroy@c-s.fr>
Tested-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This reverts commit f2538f9993. The patch
stopped JFFS2 from being able to mount an existing filesystem with the
following errors:
jffs2: error: (77) jffs2_build_inode_fragtree: Add node to tree failed -22
jffs2: error: (77) jffs2_do_read_inode_internal: Failed to build final fragtree for inode #5377: error -22
Fixes: f2538f9993 ("jffs2: Fix possible null-pointer dereferences...")
Cc: stable@vger.kernel.org
Suggested-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Richard Weinberger <richard@nod.at>
Tung Nguyen says:
====================
tipc: Fix some bugs at socket layer
This series fixes some bugs at socket layer.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Scenario:
1. A client socket initiates a SYN message to a listening socket.
2. The send link is congested, the SYN message is put in the
send link and a wakeup message is put in wakeup queue.
3. The congestion situation is abated, the wakeup message is
pulled out of the wakeup queue. Function tipc_sk_push_backlog()
is called to send out delayed messages by Nagle. However,
the client socket is still in CONNECTING state. So, it sends
the SYN message in the socket write queue to the listening socket
again.
4. The listening socket receives the first SYN message and creates
first server socket. The client socket receives ACK- and establishes
a connection to the first server socket. The client socket closes
its connection with the first server socket.
5. The listening socket receives the second SYN message and creates
second server socket. The second server socket sends ACK- to the
client socket, but it has been closed. It results in connection
reset error when reading from the server socket in user space.
Solution: return from function tipc_sk_push_backlog() immediately
if there is pending SYN message in the socket write queue.
Fixes: c0bceb97db ("tipc: add smart nagle feature")
Signed-off-by: Tung Nguyen <tung.q.nguyen@dektech.com.au>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In function __tipc_shutdown(), the timeout value passed to
tipc_wait_for_cond() is not jiffies.
This commit fixes it by converting that value from milliseconds
to jiffies.
Fixes: 365ad353c2 ("tipc: reduce risk of user starvation during link congestion")
Signed-off-by: Tung Nguyen <tung.q.nguyen@dektech.com.au>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When tipc_sk_timeout() is executed but user space is grabbing
ownership, this function rearms itself and returns. However, the
socket reference counter is not reduced. This causes potential
unexpected behavior.
This commit fixes it by calling sock_put() before tipc_sk_timeout()
returns in the above-mentioned case.
Fixes: afe8792fec ("tipc: refactor function tipc_sk_timeout()")
Signed-off-by: Tung Nguyen <tung.q.nguyen@dektech.com.au>
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When initiating a connection message to a server side, the connection
message is cloned and added to the socket write queue. However, if the
cloning is failed, only the socket write queue is purged. It causes
memory leak because the original connection message is not freed.
This commit fixes it by purging the list of connection message when
it cannot be cloned.
Fixes: 6787927475 ("tipc: buffer overflow handling in listener socket")
Reported-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: Tung Nguyen <tung.q.nguyen@dektech.com.au>
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This driver forgets to kill tasklet in remove.
Add the call to fix it.
Fixes: 032dc41ba6 ("net: macb: Handle HRESP error")
Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski says:
====================
net: tls: fix scatter-gather list issues
This series kicked of by a syzbot report fixes three issues around
scatter gather handling in the TLS code. First patch fixes a use-
-after-free situation which may occur if record was freed on error.
This could have already happened in BPF paths, and patch 2 now makes
the same condition occur in non-BPF code.
Patch 2 fixes the problem spotted by syzbot. If encryption failed
we have to clean the end markings from scatter gather list. As
suggested by John the patch frees the record entirely and caller
may retry copying data from user space buffer again.
Third patch fixes a bug in the TLS 1.3 code spotted while working
on patch 2. TLS 1.3 may effectively overflow the SG list which
leads to the BUG() in sg_page() being triggered.
Patch 4 adds a test case which triggers this bug reliably.
Next two patches are small cleanups of dead code and code which
makes dangerous assumptions.
Last but not least two minor improvements to the sockmap tests.
Tested:
- bpf/test_sockmap
- net/tls
- syzbot repro (which used error injection, hence no direct
selftest is added to preserve it).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
perror(str) is basically equivalent to
print("%s: %s\n", str, strerror(errno)).
New line or colon at the end of str is
a mistake/breaks formatting.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>