Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-1859

Unexpected results from a projection after a union and join

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.8.0
    • 0.8.1
    • data
    • None
    • Mac OS 10.6.6
      java version "1.6.0_22"
      Java(TM) SE Runtime Environment (build 1.6.0_22-b04-307-10M3261)
      Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03-307, mixed mode)

    Description

      Posted this to the pig-users list, and another user indicated he had seen it too, so I thought I'd open a ticket. Adding as major because I can't workaround the projection issue even with numbered $n column names.

      I have a log file with events on pages, and the id of the event can be a users login or a users numeric id:

      2010-05-14,james
      2010-05-15,123
      2010-05-15,23
      2010-05-15,456
      2010-05-15,notjames

      So i want to join a set of users on either the login or user id.

      Here's my users:

      123,james,11
      234,notjames,11
      456,someoneelse,11

      So I thought I would be clever and load the user list, union it with itself to generate a relation where each user is represented twice, once by login, once by id:

      logins = FOREACH users GENERATE LOWER(login) as matching_id, user_id as actual_user_id;
      user_ids = FOREACH users GENERATE user_id as matching_id, user_id as actual_user_id;
      user_id_or_login_lookup = UNION logins, user_ids;

      user_id_or_login_lookup:

      {matching_id: chararray,actual_user_id: chararray}

      (123,123)
      (234,234)
      (456,456)
      (james,123)
      (notjames,234)
      (someoneelse,456)

      Then join on that, by the first column, and project that away, leaving just the event info and the numeric id.

      views_with_id = JOIN profile_views by viewed_user_id, user_id_or_login_lookup by matching_id;

      That is not working however. My joined relation looks like this (which is what I expect)

      views_with_id:

      {profile_views::date: chararray,profile_views::viewed_user_id: chararray,user_id_or_login_lookup::matching_id: chararray,user_id_or_login_lookup::actual_user_id: chararray}

      (2010-05-15,123,123,123)
      (2010-05-15,456,456,456)
      (2010-05-14,james,james,123)
      (2010-05-15,notjames,notjames,234)

      But when I project as follows: views_with_id_projected = FOREACH views_with_id GENERATE date, viewed_user_id, user_id_or_login_lookup::actual_user_id;

      The result is not what I expect

      (2010-05-15,123,123)
      (2010-05-15,456,456)
      (2010-05-14,james,james)
      (2010-05-15,notjames,notjames)

      To be clear, I expect

      (2010-05-15,123,123)
      (2010-05-15,456,456)
      (2010-05-14,james,123)
      (2010-05-15,notjames,456)

      Here's my full pig script:

      users = LOAD 'patients-test.txt' USING PigStorage(',') AS (user_id:chararray, login:chararray, disease_id: chararray);
      profile_views = LOAD 'patient-views-test.txt' USING PigStorage(',') AS(date: chararray, viewed_user_id:chararray);

      dump users;
      dump profile_views;

      – build a relation so that users are present to join by login or user_id
      logins = FOREACH users GENERATE LOWER(login) as matching_id, user_id as actual_user_id;
      user_ids = FOREACH users GENERATE user_id as matching_id, user_id as actual_user_id;
      user_id_or_login_lookup = UNION logins, user_ids;

      dump user_id_or_login_lookup;
      describe user_id_or_login_lookup;

      views_with_id = JOIN profile_views by viewed_user_id, user_id_or_login_lookup by matching_id;

      describe views_with_id;
      --STORE views_with_id into 'ep-views.txt';

      dump views_with_id;

      views_with_id_projected = FOREACH views_with_id GENERATE date, viewed_user_id, user_id_or_login_lookup::actual_user_id;

      dump views_with_id_projected;

      Attachments

        Activity

          People

            Unassigned Unassigned
            jkebinger James Kebinger
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: