Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-8423

Japanese characters encoding issue

Details

    • Bug
    • Status: Open
    • P3
    • Resolution: Unresolved
    • 2.15.0
    • None
    • runner-dataflow
    • None
    • dataflow

    Description

      I am running apache beam job to parse japanese html pages. While running the job, I see in stackdriver log it is showing japanese character properly. But same data written to GCS bucket has encoding issue and it is getting corrupted.

       

      //code
      
      
      Pipeline pipeline = Pipeline.create(options);
      CoderRegistry cr = pipeline.getCoderRegistry();
      cr.registerCoderForClass(String.class, StringUtf8Coder.of());
      cr.registerCoderForClass(Integer.class, BigEndianIntegerCoder.of());
      
      batchTuple = pipeline
      		.apply("Read from input files", TextIO.read().from(options.getloadingBucketURL()).withCompression(Compression.GZIP)).setCoder(StringUtf8Coder.of())
      		.apply("Process input files",ParDo.of(new ExtractDataFromHtmlPage(extractionConfig,beamConfig.getLoadingBucketURL())).withOutputTags(successRecord, TupleTagList.of(errorRecord).and(deadLetterRecords)));

       

       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            jyotiaditya Jyoti Aditya
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: