Uploaded image for project: 'UIMA'
  1. UIMA
  2. UIMA-6487

Support Aggregate Engines in Apache UIMACPP

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None

    Description

      UIMA is a framework for unstructured information management, built around the idea of heavy annotators interoperating using a common exchange format.

      It has been in production use for about two decades.

      The framework is mostly written in Java. It has a C++ counterpart that implements a subset of the framework.

      The challenge for this GSOC is to work together with the mentor to implement the full framework.

      More details on GitHub: https://github.com/apache/uima-uimacpp/issues/6

      Benefits to the community

      This has been discussed as one of the main roadblocks in using the C++ version of the framework by its users: https://lists.apache.org/thread/f1r3sghgn2oqhvzz27y26zg6j3olv8qq

      On a larger perspective, there is the question of why we need NLP frameworks in 2024. The field has moved to approaches where source text is consumed in a destructive tokenization process that generates subtoken indices over a fixed vocabulary. These are then fed as input to a deep/transformer neural network.

      Now, when training said networks, particularly when building Large Language Models (LLMs), gargantuan amounts of texts are quickly tokenized and fed into the model being trained. Additional computational efforts at indexing time can help improve data quality, privacy and terms of use of the text. A high performant UIMA CPP can be the missing piece for quality input data to LLMs.

      Technical Skills

      Working on this problem requires intermediate knowledge of the C++ programming language.

      A solution will most probably exercise this type of skills, which could be learned along the way parallel to the project (mentoring on these topics is not part of the project):

      • Linux command-line and build systems
      • XML parsing
      • Docker (image creation, deployment, debugging)

      About the mentor

      Dr. Duboue has more than 25 years of experience in AI.  He has a Ph.D. in Computer Science from Columbia University. and was a member of the IBM Watson team that beat the Jeopardy! Champions.

      Aside from his consulting work, he he has taught in three different countries and done joint research with more than fifty co-authors.

      He has years of experience mentoring both students and employees.

       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            drdub Pablo Duboue
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: