Created attachment 112270 [details] Attachments zipped: spreadsheet to reproduce, SumIfs OpenCL kernel code, the log. The issue can be identified as an endless run with PCMark8, when selecting Spreadsheet in Work test, using acceleration mode. After a bit of investigation, we have identify the issue seems like incorrect set/usages of parameters that result out-of-boundaries in OpenCL kernel execution. And it seems located somewhere around https://gitorious.org/libreoffice/core/source/7d68e43e8b35419dc481ce3a06716113b4045839:sc/source/core/opencl/op_math.cxx# Line 442 void OpSumIfs::GenSlidingWindowFunction(std::stringstream &ss, const std::string sSymName, SubArguments &vSubArguments). More information: Here is an example of a buffer allocated with wrong size. DEBUG 13:58:16 4032 c:\work\aderi\vcp_olc-opencl-sdk\src\framework\Context\context_module.cpp (1635) Intel::OpenCL::Framework::ContextModule::CreateBuffer CreateBuffer return handle 120115056 clCreateBuffer(0x2db9228, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR | CL_MEM_COPY_HOST_PTR, 8, 0x153de490, CL_SUCCESS) = 0x728cf70 DEBUG 13:58:16 4032 c:\work\aderi\vcp_olc-opencl-sdk\src\framework\Context\kernel.cpp (797) Intel::OpenCL::Framework::Kernel::SetKernelArg SetKernelArg buffer cl_mem=120115056 clSetKernelArg(0x7234350, 4, 4, 0x1383e10) = CL_SUCCESS DEBUG 13:58:16 4032 c:\work\aderi\vcp_olc-opencl-sdk\src\devices\cpu_device\cpu_device.cpp (2019) Intel::OpenCL::CPUDevice::CPUDevice::clDevFlushCommandList clDevFlushCommandList Function enter clEnqueueNDRangeKernel(0x2dd5768, 0x7234350, 2, nullptr, { 256, 100 }, { 256, 1 }, 0, nullptr, nullptr) = CL_SUCCESS The buffer allocated for 5th argument (argument number 4) of kernel tmp0_0_0_SumIfs_reduction was created with size 8 bytes. The kernel was executed with global size = {256, 100} and local size {256, 1} => group size = {1, 100} The kernel tmp0_0_0_SumIfs_reduction access 5th argument __global double *tmp0_0_4 as follows: tmp4= fsum(tmp0_0_4[get_group_id(1)], 0); Summary by Tor: - The version you are using is the tag libreoffice-4-2-milestone-5, in the branch libreoffice-4-2. - The call to clCreateBuffer() that your log shows is most likely the one in ParallelReductionVectorRef::Marshal() and the clSetKernelArg() call is the one in DynamicKernelSoPArguments::Marshal(), the one in the for (size_t j=0; j< vclmem.size(); j++) loop, and it is the ‘redKernel’ kernel that has the name “tmp0_0_0_SumIfs_reduction”. This all is in the sc/source/core/opencl/formulagroupcl.cxx source file. - The problematic OpenCL code line tmp4= fsum(tmp0_0_4[get_group_id(1)], 0) is generated in CheckVariables::CheckSubArgumentIsNan2() in sc/source/core/opencl/opbase.cxx. The issue is reproducible sporadically. We experience failures while running PCMark. One can trigger SumIfs calculation using spreadsheet in the attachment, if OpenCL acceleration is turned on and maybe even forced. The archive attached contains: - spreadsheet that uses SumIfs. SumIfs are located on the sheet named 'Main', starting S3. - SumIfs kernel source code, that is generated while calculating the formula. - Log file of OpenCL calls.
The cl_284_fail.log file is rather massive, can you give any hints what to look for in it to find information relevant to the problem? Also, would it be possible to reduce the sample spreadsheet even further? The ideal would be to have just one column of the problematic SUMIFS formulas. Would make following the output when I add debugging printouts to the OpenCL code and to the LO C++ code much easier.
Do I understand correctly, that you are noticing the array out-of-bound references only sporadically, when the access happens to cross a page boundary and point into a nonexistent page (or a read-protected page)? Even if the out-of-bounds accesses happen presumably regularly, but as long as the result is an access to an existing page, they aren't trapped? Is there perhaps some "debug mode" in your OpenCL implementation, that would cause it to keep track of the size of each buffer created with clCreateBuffer, and then when a kernel has code that dereferenes a pointer to such a buffer (as in the tmp0_0_4[get_group_id(1)]), it would check that the index does not cause out-of-bound access? Kind of like a dynamic memory checker or memory debugger for OpenCL. (Like Valgrind, Purify etc for host platform code.)
I added some debugging printout to the code to log the OpenCL buffers created, the host buffer address in cases where CL_MEM_USE_HOST_PTR is used, and to the generated OpenCL for the problematic lines, the one where indexing a buffer with get_group_id(1). I could not see any out-of-bound access in the resulting output. Attaching the diff and the output.
Created attachment 112280 [details] Patch to add debugging output to the relevant C++ and OpenCL code
Created attachment 112281 [details] Debug output from loading the attached spreadsheet Output from running SAL_LOG=+WARN.sc.opencl+INFO.opencl+INFO.sc.opencl instdir/program/soffice /home/tml/Downloads/spreadsheet.ods
Yes, we notice out-of-bounds array references sporadically, but I'm not sure that it is connected with the sheets. I agree, the spreadsheet I attached is not the reproducer. Currently we are working on it. For now my suspicion is that there is a subtle sequence of actions produced by an AutoIt script with Calc that causes the issue. Meantime, please, check that we are working on the save revision: Version: 4.2.0.0.beta2+ Build ID: da443ab58158d2b7ffa52742cec2be76e3aa2026.
I meant on the same revision, of course.
Yes, that is the revision I have in the tree where I am looking at this.
Created attachment 112341 [details] Suggested patch
Could you please share with us LibreOffice build that has this patch? Unfortunately we do not have the environment set up to build it.
Created attachment 112453 [details] Replacement scopencllo.dll
Created attachment 112479 [details] Log with dumping of buffer sizes and handles
By the way, I've studied our log (attached cl_3880.log) more closely, and that's what I found: On line 283 we can see tmp0_0_0_SumIfs_reduction kernel is being queried for arg info. SetKernelArg (uiIndex=4) on line 297, which is 5th argument (i.e. tmp0_0_4), is being initialized with buffer handle 363061576. Above in the line 194 this handle can be seen returned from the CreateBuffer (line 185). The debug output line 168 shows it was created with size 8. Just a bit below (line 208) the same buffer with handle 363061576 is being set as arg5 of DynamicKernel_nop_SumIfs kernel. In other words, the buffer is created with size 8 for one kernel - DynamicKernel_nop_SumIfs (but actually tmp0_0_4 is not used there). And then reused in tmp0_0_0_SumIfs_reduction without re-allocating. All kernels seems to be enqeued with { 256, 100 }, { 256, 1 }. May be this will give you guys a hint. Please correct me in case I'm mistaken.
Yes; isn't that was described already from the start, that a buffer is allocated with size 8 (one double) but then accessed way beyond that?
Yes, the new information is the buffer is used in 2 kernels.
But that is not a problem, is it?
No, this is an observation, that might give you a hint regarding the root cause.
We experience issues with drop-in replacement of the dll. We observe crashes or hanging of the Calc. Does patched Calc works correctly with/without OpenCL on your side?
Created attachment 113075 [details] Spreadsheet to reproduce
I prepared a reproducer spreadsheet confirmed by trace log of OpenCL calls. The spreadsheet and the log are attached. SumIfs are located in Main.S3:S102 in the single column as requested. Hope this will help you to root cause the issue. 1. Open destination_workbook.xlsx 2. Force Calc to compute formulas using OpenCL 3. Press Ctrl-Shift-F9 to recalculate all the formulas.
Created attachment 113076 [details] Reproducer trace log of OpenCL calls
When you say "Spreadsheet to reproduce", do you mean that you see the crash with that document? Is that with or without the suggested patch?
Created attachment 113306 [details] Patch to add even more debugging output
I don't see that the reproduction document from comment #19 even causes the OpenCL code in tmp0_0_0_SumIfs_reduction() to enter the code paths where the patch from comment #9 adds additional tests to avoid buffer overflow.
The spreadsheet in comment #19 is intended to demonstrate the creation of the buffer with wrong size while global id ranges from 0 to 99. I used LibreOffice without any patches. I suppose OpenCL code may not reach the crash point.
For the 4.2 branch, our priority is to fix the crash. The OpenCL-generating code is much different in the 4.3, 4.4 and master branches.
Thanks, Tor. I've been looking into 4.2 milestone-7-2, and I have a suspicion that the issue is fixed there. I will perform additional validations to be sure and update this ticket ASAP. I guess 4.2 milestone-7-2 doesn't differ significantly comparing to the revision in the ticket, is that so?
The changes between libreoffice-4-2-milestone-5 and libreoffice-4-2-milestone-7 can be seen with a simple git command, see below. Especially the "GPU-Calc: Change the szHostBuffer to fix the out of boundary bug" change seems quite relevant to this bug. $ git log --stat libreoffice-4-2-milestone-5..libreoffice-4-2-milestone-7 commit f6fb7359657425acfb045a6daeaad20ca70a6cb8 (tag: libreoffice-4-2-milestone-7) Author: Wei Wei <weiwei@multicorewareinc.com> Date: Fri Feb 28 15:07:36 2014 -0600 GPU-Calc: remove Alloc_Host_Ptr for clmem of NAN vector Change-Id: I94841f3e4df30265f609b1405453f18f251e6beb sc/source/core/opencl/formulagroupcl.cxx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) commit e83eab460b7f31265b209c4bfff7eb4d235e9a77 (tag: libreoffice-4-2-milestone-6) Author: Wei Wei <weiwei@multicorewareinc.com> Date: Fri Feb 28 11:57:56 2014 -0600 GPU-Calc: remove the flag of USE_HOST_PTR and ALLOC_HOST_PTR Change-Id: I000688c93b0feeb6da213b0f1f307a5062a41504 sc/source/core/opencl/formulagroupcl.cxx | 47 ++++++++++++++++++-------------- 1 file changed, 27 insertions(+), 20 deletions(-) commit 4f0d08df0f48c1c47f11e6f5206f445823deca1e Author: Wei Wei <weiwei@multicorewareinc.com> Date: Fri Feb 28 11:25:33 2014 -0600 GPU-Calc: Change the szHostBuffer to fix the out of boundary bug Change-Id: I4037fb3c0ddda1a9c4b0e777e1a75bc0b11fd24b sc/source/core/opencl/formulagroupcl.cxx | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) commit d1fc65581906ef0ce4c8e276ae67a35fb0fc6dfd Author: Kohei Yoshida <kohei.yoshida@collabora.com> Date: Wed Feb 26 14:32:57 2014 -0500 Store the length of originally requested array size prior to trimming. This change adds GetRequestedArrayLength() method to both single and double vector ref tokens, which returns the length of the requested array size prior to trimming of the trailing empty cell region. Change-Id: Iaba96fa2ea4ff3c8bccb0bc86fa4f1525e2f45fb formula/source/core/api/vectortoken.cxx | 29 +++++++++++++++++++++-------- include/formula/vectortoken.hxx | 13 +++++++++---- sc/source/core/data/grouptokenconverter.cxx | 12 ++++++------ 3 files changed, 36 insertions(+), 18 deletions(-) commit 7da51df558c1fbc21188a0c58773ba549195ca5a Author: Kohei Yoshida <kohei.yoshida@collabora.com> Date: Wed Feb 26 16:29:27 2014 -0500 Ensure that numeric array storage is aligned to 256-byte boundary. OpenCL devices require this else we would get a performance hit. Change-Id: I6b1db6320fa84f933b6446022a0fd02ba267bf21 sc/inc/formulagroup.hxx | 3 +- sc/inc/stlalgorithm.hxx | 91 +++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 93 insertions(+), 1 deletion(-)
I confirm that the issue is fixed in libreoffice-4-2-milestone-7-2. Most likely it's the commit commit 4f0d08df0f48c1c47f11e6f5206f445823deca1e Author: Wei Wei <weiwei@multicorewareinc.com> Date: Fri Feb 28 11:25:33 2014 -0600 GPU-Calc: Change the szHostBuffer to fix the out of boundary bug Change-Id: I4037fb3c0ddda1a9c4b0e777e1a75bc0b11fd24b sc/source/core/opencl/formulagroupcl.cxx | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)