Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-3093

ReopenClient() could NULL out 'client_key' causing a crash

    XMLWordPrintableJSON

Details

    Description

      While running the stress tests with a custom patch for IMPALA-2592, I'm hitting a crash in DoRpc() with the following stack:

      Stack: [0x00007fab64253000,0x00007fab64c54000],  sp=0x00007fab64c51f90,  free space=10235k
      Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
      C  [impalad+0x10163c4]  impala::Status impala::ClientConnection<impala::ImpalaInternalServiceClient>::DoRpc<void (impala::ImpalaInternalServiceClient::*)(impala::TReportExecStatusResult&, impala::TReportExecStatusParams const&), impala::TReportExecStatusParams, impala::TReportExecStatusResult>(void (impala::ImpalaInternalServiceClient::* const&)(impala::TReportExecStatusResult&, impala::TReportExecStatusParams const&), impala::TReportExecStatusParams const&, impala::TReportExecStatusResult*)+0x108
      C  [impalad+0x1015596]  impala::FragmentMgr::FragmentExecState::ReportStatusCb(impala::Status const&, impala::RuntimeProfile*, bool)+0x598
      C  [impalad+0x100fe97]  boost::_mfi::mf3<void, impala::FragmentMgr::FragmentExecState, impala::Status const&, impala::RuntimeProfile*, bool>::operator()(impala::FragmentMgr::FragmentExecState*, impala::Status const&, impala::RuntimeProfile*, bool) const+0x7d
      C  [impalad+0x100f65a]  void boost::_bi::list4<boost::_bi::value<impala::FragmentMgr::FragmentExecState*>, boost::arg<1>, boost::arg<2>, boost::arg<3> >::operator()<boost::_mfi::mf3<void, impala::FragmentMgr::FragmentExecState, impala::Status const&, impala::RuntimeProfile*, bool>, boost::_bi::list3<impala::Status const&, impala::RuntimeProfile*&, bool&> >(boost::_bi::type<void>, boost::_mfi::mf3<void, impala::FragmentMgr::FragmentExecState, impala::Status const&, impala::RuntimeProfile*, bool>&, boost::_bi::list3<impala::Status const&, impala::RuntimeProfile*&, bool&>&, int)+0xa8
      C  [impalad+0x100efed]  void boost::_bi::bind_t<void, boost::_mfi::mf3<void, impala::FragmentMgr::FragmentExecState, impala::Status const&, impala::RuntimeProfile*, bool>, boost::_bi::list4<boost::_bi::value<impala::FragmentMgr::FragmentExecState*>, boost::arg<1>, boost::arg<2>, boost::arg<3> > >::operator()<impala::Status const, impala::RuntimeProfile*, bool>(impala::Status const&, impala::RuntimeProfile*&, bool&)+0x53
      C  [impalad+0x100eacd]  boost::detail::function::void_function_obj_invoker3<boost::_bi::bind_t<void, boost::_mfi::mf3<void, impala::FragmentMgr::FragmentExecState, impala::Status const&, impala::RuntimeProfile*, bool>, boost::_bi::list4<boost::_bi::value<impala::FragmentMgr::FragmentExecState*>, boost::arg<1>, boost::arg<2>, boost::arg<3> > >, void, impala::Status const&, impala::RuntimeProfile*, bool>::invoke(boost::detail::function::function_buffer&, impala::Status const&, impala::RuntimeProfile*, bool)+0x39
      C  [impalad+0x13fa176]  boost::function3<void, impala::Status const&, impala::RuntimeProfile*, bool>::operator()(impala::Status const&, impala::RuntimeProfile*, bool) const+0x68
      C  [impalad+0x13f7e55]  impala::PlanFragmentExecutor::SendReport(bool)+0x10b
      C  [impalad+0x13f7aef]  impala::PlanFragmentExecutor::ReportProfile()+0x6bf
      C  [impalad+0x13fbc4b]  boost::_mfi::mf0<void, impala::PlanFragmentExecutor>::operator()(impala::PlanFragmentExecutor*) const+0x65
      C  [impalad+0x13fb992]  void boost::_bi::list1<boost::_bi::value<impala::PlanFragmentExecutor*> >::operator()<boost::_mfi::mf0<void, impala::PlanFragmentExecutor>, boost::_bi::list0>(boost::_bi::type<void>, boost::_mfi::mf0<void, impala::PlanFragmentExecutor>&, boost::_bi::list0&, int)+0x4a
      C  [impalad+0x13fb5f7]  boost::_bi::bind_t<void, boost::_mfi::mf0<void, impala::PlanFragmentExecutor>, boost::_bi::list1<boost::_bi::value<impala::PlanFragmentExecutor*> > >::operator()()+0x3b
      C  [impalad+0x13fb3bc]  boost::detail::function::void_function_obj_invoker0<boost::_bi::bind_t<void, boost::_mfi::mf0<void, impala::PlanFragmentExecutor>, boost::_bi::list1<boost::_bi::value<impala::PlanFragmentExecutor*> > >, void>::invoke(boost::detail::function::function_buffer&)+0x20
      C  [impalad+0xe1fc76]  boost::function0<void>::operator()() const+0x52
      C  [impalad+0x10cd6b9]  impala::Thread::SuperviseThread(std::string const&, std::string const&, boost::function<void ()()>, impala::Promise<long>*)+0x2c5
      C  [impalad+0x10d4d34]  void boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()()> >, boost::_bi::value<impala::Promise<long>*> >::operator()<void (*)(std::string const&, std::string const&, boost::function<void ()()>, impala::Promise<long>*), boost::_bi::list0>(boost::_bi::type<void>, void (*&)(std::string const&, std::string const&, boost::function<void ()()>, impala::Promise<long>*), boost::_bi::list0&, int)+0xb2
      C  [impalad+0x10d4c77]  boost::_bi::bind_t<void, void (*)(std::string const&, std::string const&, boost::function<void ()()>, impala::Promise<long>*), boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()()> >, boost::_bi::value<impala::Promise<long>*> > >::operator()()+0x3b
      C  [impalad+0x10d4c3a]  boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(std::string const&, std::string const&, boost::function<void ()()>, impala::Promise<long>*), boost::_bi::list4<boost::_bi::value<std::string>, boost::_bi::value<std::string>, boost::_bi::value<boost::function<void ()()> >, boost::_bi::value<impala::Promise<long>*
      > > > >::run()+0x1e
      

      Looking at the disassembly, it fails here:

      Dump of assembler code for function impala::ClientConnection<impala::ImpalaInternalServiceClient>::DoRpc<void (impala::ImpalaInternalServiceClient::*)(impala::TReportExecStatusResult&, impala::TReportExecStatusParams const&), impala::TReportExecStatusParams, impala::TReportExecStatusResult>(void (impala::ImpalaInternalServiceClient::*&)(impala::ImpalaInternalServiceClient*, impala::TReportExecStatusResult&, impala::TReportExecStatusParams const&), impala::TReportExecStatusParams const&, impala::TReportExecStatusResult*):
         0x00000000014162bc <+0>:	push   %rbp
         0x00000000014162bd <+1>:	mov    %rsp,%rbp
         0x00000000014162c0 <+4>:	push   %r13
         0x00000000014162c2 <+6>:	push   %r12
         0x00000000014162c4 <+8>:	push   %rbx
         0x00000000014162c5 <+9>:	sub    $0xf8,%rsp
         0x00000000014162cc <+16>:	mov    %rdi,-0xe8(%rbp)
         0x00000000014162d3 <+23>:	mov    %rsi,-0xf0(%rbp)
         0x00000000014162da <+30>:	mov    %rdx,-0xf8(%rbp)
         0x00000000014162e1 <+37>:	mov    %rcx,-0x100(%rbp)
         0x00000000014162e8 <+44>:	mov    %r8,-0x108(%rbp)
         0x00000000014162ef <+51>:	cmpq   $0x0,-0x108(%rbp)
         0x00000000014162f7 <+59>:	sete   %al
         0x00000000014162fa <+62>:	movzbl %al,%eax
         0x00000000014162fd <+65>:	mov    $0x0,%ebx
         0x0000000001416302 <+70>:	mov    $0x0,%r12d
         0x0000000001416308 <+76>:	test   %rax,%rax
         0x000000000141630b <+79>:	je     0x1416375 <impala::ClientConnection<impala::ImpalaInternalServiceClient>::DoRpc<void (impala::ImpalaInternalServiceClient::*)(impala::TReportExecStatusResult&, impala::TReportExecStatusParams const&), impala::TReportExecStatusParams, impala::TReportExecStatusResult>(void (impala::ImpalaInternalServiceClient::*&)(impala::ImpalaInternalServiceClient*, impala::TReportExecStatusResult&, impala::TReportExecStatusParams const&), impala::TReportExecStatusParams const&, impala::TReportExecStatusResult*)+185>
         0x000000000141630d <+81>:	lea    -0xe0(%rbp),%rax
         0x0000000001416314 <+88>:	mov    $0xe3,%edx
         0x0000000001416319 <+93>:	lea    0xf36628(%rip),%rsi        # 0x234c948
         0x0000000001416320 <+100>:	mov    %rax,%rdi
         0x0000000001416323 <+103>:	callq  0x223bce0 <_ZN6google15LogMessageFatalC2EPKci>
         0x0000000001416328 <+108>:	mov    $0x1,%ebx
         0x000000000141632d <+113>:	lea    -0xe0(%rbp),%rax
         0x0000000001416334 <+120>:	mov    %rax,%rdi
         0x0000000001416337 <+123>:	callq  0x106eab6 <google::LogMessage::stream()>
         0x000000000141633c <+128>:	lea    0xf36695(%rip),%rsi        # 0x234c9d8
         0x0000000001416343 <+135>:	mov    %rax,%rdi
         0x0000000001416346 <+138>:	callq  0x1018790 <_ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc@plt>
         0x000000000141634b <+143>:	mov    %rax,%r13
         0x000000000141634e <+146>:	lea    -0xc1(%rbp),%rax
         0x0000000001416355 <+153>:	mov    %rax,%rdi
         0x0000000001416358 <+156>:	callq  0x106eacc <google::LogMessageVoidify::LogMessageVoidify()>
         0x000000000141635d <+161>:	mov    $0x1,%r12d
         0x0000000001416363 <+167>:	lea    -0xc1(%rbp),%rax
         0x000000000141636a <+174>:	mov    %r13,%rsi
         0x000000000141636d <+177>:	mov    %rax,%rdi
         0x0000000001416370 <+180>:	callq  0x106ead6 <google::LogMessageVoidify::operator&(std::ostream&)>
         0x0000000001416375 <+185>:	test   %r12b,%r12b
         0x0000000001416378 <+188>:	test   %bl,%bl
         0x000000000141637a <+190>:	je     0x141638c <impala::ClientConnection<impala::ImpalaInternalServiceClient>::DoRpc<void (impala::ImpalaInternalServiceClient::*)(impala::TReportExecStatusResult&, impala::TReportExecStatusParams const&), impala::TReportExecStatusParams, impala::TReportExecStatusResult>(void (impala::ImpalaInternalServiceClient::*&)(impala::ImpalaInternalServiceClient*, impala::TReportExecStatusResult&, impala::TReportExecStatusParams const&), impala::TReportExecStatusParams const&, impala::TReportExecStatusResult*)+208>
         0x000000000141637c <+192>:	nop
         0x000000000141637d <+193>:	lea    -0xe0(%rbp),%rax
      ---Type <return> to continue, or q <return> to quit---
         0x0000000001416384 <+200>:	mov    %rax,%rdi
         0x0000000001416387 <+203>:	callq  0x223bd00 <_ZN6google15LogMessageFatalD2Ev>
         0x000000000141638c <+208>:	nop
         0x000000000141638d <+209>:	mov    -0xf8(%rbp),%rax
         0x0000000001416394 <+216>:	mov    (%rax),%rax
         0x0000000001416397 <+219>:	and    $0x1,%eax
         0x000000000141639a <+222>:	test   %rax,%rax
         0x000000000141639d <+225>:	jne    0x14163ab <impala::ClientConnection<impala::ImpalaInternalServiceClient>::DoRpc<void (impala::ImpalaInternalServiceClient::*)(impala::TReportExecStatusResult&, impala::TReportExecStatusParams const&), impala::TReportExecStatusParams, impala::TReportExecStatusResult>(void (impala::ImpalaInternalServiceClient::*&)(impala::ImpalaInternalServiceClient*, impala::TReportExecStatusResult&, impala::TReportExecStatusParams const&), impala::TReportExecStatusParams const&, impala::TReportExecStatusResult*)+239>
         0x000000000141639f <+227>:	mov    -0xf8(%rbp),%rax
         0x00000000014163a6 <+234>:	mov    (%rax),%rax
         0x00000000014163a9 <+237>:	jmp    0x14163db <impala::ClientConnection<impala::ImpalaInternalServiceClient>::DoRpc<void (impala::ImpalaInternalServiceClient::*)(impala::TReportExecStatusResult&, impala::TReportExecStatusParams const&), impala::TReportExecStatusParams, impala::TReportExecStatusResult>(void (impala::ImpalaInternalServiceClient::*&)(impala::ImpalaInternalServiceClient*, impala::TReportExecStatusResult&, impala::TReportExecStatusParams const&), impala::TReportExecStatusParams const&, impala::TReportExecStatusResult*)+287>
         0x00000000014163ab <+239>:	mov    -0xf0(%rbp),%rax
         0x00000000014163b2 <+246>:	mov    0x8(%rax),%rdx
         0x00000000014163b6 <+250>:	mov    -0xf8(%rbp),%rax
         0x00000000014163bd <+257>:	mov    0x8(%rax),%rax
         0x00000000014163c1 <+261>:	add    %rdx,%rax
      => 0x00000000014163c4 <+264>:	mov    (%rax),%rdx
         0x00000000014163c7 <+267>:	mov    -0xf8(%rbp),%rax
         0x00000000014163ce <+274>:	mov    (%rax),%rax
         0x00000000014163d1 <+277>:	sub    $0x1,%rax
         0x00000000014163d5 <+281>:	add    %rdx,%rax
         0x00000000014163d8 <+284>:	mov    (%rax),%rax
         0x00000000014163db <+287>:	mov    -0xf0(%rbp),%rdx
         0x00000000014163e2 <+294>:	mov    0x8(%rdx),%rcx
         0x00000000014163e6 <+298>:	mov    -0xf8(%rbp),%rdx
         0x00000000014163ed <+305>:	mov    0x8(%rdx),%rdx
         0x00000000014163f1 <+309>:	lea    (%rcx,%rdx,1),%rdi
         0x00000000014163f5 <+313>:	mov    -0x100(%rbp),%rdx
         0x00000000014163fc <+320>:	mov    -0x108(%rbp),%rcx
         0x0000000001416403 <+327>:	mov    %rcx,%rsi
         0x0000000001416406 <+330>:	callq  *%rax
         0x0000000001416408 <+332>:	mov    -0xe8(%rbp),%rax
         0x000000000141640f <+339>:	mov    %rax,%rdi
         0x0000000001416412 <+342>:	callq  0x108027f <impala::Status::OK()>
      

      At that point, the 'client_' (from the class ClientConnection) should be in $rdx, but $rdx is NULL causing the crash. This is odd and there isn't a reasonable explanation as to why it happens as of yet.

      The crash does not occur immediately, it happens only after remote nodes become unreachable (which under the conditions in the following run, happens after ~2 hours:
      http://sandbox.jenkins.cloudera.com/view/Impala/view/Stress/job/Impala-Stress-Test-EC2-CDH5-trunk/514/parameters/
      )
      Currently, it doesn't seem to be related to the patch for IMPALA-2592. It seems like the patch exposes an existing bug. I'm still digging into what causes the crash and don't know the reason yet. I will update once I have more information.

      This crash does not show up without the patch. (The patch is here at: http://gerrit.cloudera.org:8080/#/c/2205/7)

      Attachments

        Activity

          People

            sailesh Sailesh Mukil
            sailesh Sailesh Mukil
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: