Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-34201

Datadog name resolution fails and do not retry causing metrics to not get exported

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 1.17.2
    • None
    • Runtime / Metrics
    • None

    Description

      When node restarts happens on k8s, some deployments fail to report metrics to datadog.

      At first, I thought it could be related to some timeout and added a cap of 500 metrics. But then I got to this exception:

      java.lang.IllegalStateException: Failed contacting Datadog to validate API key
      	at org.apache.flink.metrics.datadog.DatadogHttpClient.validateApiKey(DatadogHttpClient.java:106) ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
      	at org.apache.flink.metrics.datadog.DatadogHttpClient.<init>(DatadogHttpClient.java:86) ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
      	at org.apache.flink.metrics.datadog.DatadogHttpReporter.<init>(DatadogHttpReporter.java:75) ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
      	at org.apache.flink.metrics.datadog.DatadogHttpReporterFactory.createMetricReporter(DatadogHttpReporterFactory.java:59) ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
      	at org.apache.flink.runtime.metrics.ReporterSetup.loadViaFactory(ReporterSetup.java:418) ~[flink-dist-1.17.2.jar:1.17.2]
      	at org.apache.flink.runtime.metrics.ReporterSetup.loadViaFactory(ReporterSetup.java:408) ~[flink-dist-1.17.2.jar:1.17.2]
      	at org.apache.flink.runtime.metrics.ReporterSetup.loadReporter(ReporterSetup.java:372) ~[flink-dist-1.17.2.jar:1.17.2]
      	at org.apache.flink.runtime.metrics.ReporterSetup.setupReporters(ReporterSetup.java:326) ~[flink-dist-1.17.2.jar:1.17.2]
      	at org.apache.flink.runtime.metrics.ReporterSetup.fromConfiguration(ReporterSetup.java:207) ~[flink-dist-1.17.2.jar:1.17.2]
      	at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.startTaskManagerRunnerServices(TaskManagerRunner.java:224) ~[flink-dist-1.17.2.jar:1.17.2]
      	at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.start(TaskManagerRunner.java:293) ~[flink-dist-1.17.2.jar:1.17.2]
      	at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManager(TaskManagerRunner.java:486) ~[flink-dist-1.17.2.jar:1.17.2]
      	at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.lambda$runTaskManagerProcessSecurely$5(TaskManagerRunner.java:530) ~[flink-dist-1.17.2.jar:1.17.2]
      	at org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28) [flink-dist-1.17.2.jar:1.17.2]
      	at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManagerProcessSecurely(TaskManagerRunner.java:530) [flink-dist-1.17.2.jar:1.17.2]
      	at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManagerProcessSecurely(TaskManagerRunner.java:510) [flink-dist-1.17.2.jar:1.17.2]
      	at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.main(TaskManagerRunner.java:468) [flink-dist-1.17.2.jar:1.17.2]
      Caused by: java.net.UnknownHostException: app.datadoghq.com: Temporary failure in name resolution
      	at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method) ~[?:?]
      	at java.net.InetAddress$PlatformNameService.lookupAllHostAddr(Unknown Source) ~[?:?]
      	at java.net.InetAddress.getAddressesFromNameService(Unknown Source) ~[?:?]
      	at java.net.InetAddress$NameServiceAddresses.get(Unknown Source) ~[?:?]
      	at java.net.InetAddress.getAllByName0(Unknown Source) ~[?:?]
      	at java.net.InetAddress.getAllByName(Unknown Source) ~[?:?]
      	at java.net.InetAddress.getAllByName(Unknown Source) ~[?:?]
      	at okhttp3.Dns.lambda$static$0(Dns.java:39) ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
      	at okhttp3.internal.connection.RouteSelector.resetNextInetSocketAddress(RouteSelector.java:171) ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
      	at okhttp3.internal.connection.RouteSelector.nextProxy(RouteSelector.java:135) ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
      	at okhttp3.internal.connection.RouteSelector.next(RouteSelector.java:84) ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
      	at okhttp3.internal.connection.ExchangeFinder.findConnection(ExchangeFinder.java:187) ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
      	at okhttp3.internal.connection.ExchangeFinder.findHealthyConnection(ExchangeFinder.java:108) ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
      	at okhttp3.internal.connection.ExchangeFinder.find(ExchangeFinder.java:88) ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
      	at okhttp3.internal.connection.Transmitter.newExchange(Transmitter.java:169) ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
      	at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:41) ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
      	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:142) ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
      	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:117) ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
      	at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:94) ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
      	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:142) ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
      	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:117) ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
      	at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93) ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
      	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:142) ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
      	at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:88) ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
      	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:142) ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
      	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:117) ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
      	at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:229) ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
      	at okhttp3.RealCall.execute(RealCall.java:81) ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
      	at org.apache.flink.metrics.datadog.DatadogHttpClient.validateApiKey(DatadogHttpClient.java:101) ~[flink-metrics-datadog-1.17.2.jar:1.17.2]
      

      There is no retry mechanism here

          private void validateApiKey() {
              Request r = new Request.Builder().url(validateUrl).get().build();
      
              try (Response response = client.newCall(r).execute()) {
                  if (!response.isSuccessful()) {
                      throw new IllegalArgumentException(String.format("API key: %s is invalid", apiKey));
                  }
              } catch (IOException e) {
                  throw new IllegalStateException("Failed contacting Datadog to validate API key", e);
              }
          }
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            pedromazala Pedro Mázala
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: