The Tribe AI community recently organized its first-ever LLM Hackathon, bringing together over 60 registered participants, along with an incredible lineup of sponsors and judges. Our objective was clear: as a collective of AI technologists, researchers, and product leaders, we aimed to discover novel applications and infrastructure for LLMs in a collaborative and supportive environment.
As cutting-edge advances push the boundaries of what LLMs can do, it felt crucial for us to stay abreast of these developments through hands-on experience. We focused on the most pressing questions and hottest trends in AI:
- What are the emergent behaviors of LLMs?
- To what extent do LLMs still struggle with negation, and how can we prevent them from generating unreliable or fabricated information?
- In what situations can smaller, custom-trained models outperform larger, general-purpose language models? What strategies can we use to parallelize tasks and improve response times?
- How can user interfaces and tools improve the steerability and customization of chatbot experiences?
- What new approaches or techniques are required for effective asynchronous scientific computing?
- What does it take to train and run one’s own LLM, and how do the high memory requirements and GPU dependency of these models pose challenges? How can we mitigate or work around these challenges?
The answers to these questions were critical for navigating the LLM Hackathon successfully.
What did we learn?
During the week-long event, participants attended informal sessions to share learnings, challenges, obstacles, and feedback. Here are a few key takeaways that emerged during the hackathon.
Theme #1: Performance and Efficiency of LLMs
Key takeaway: Smaller, tailored models proved more effective despite being significantly smaller than LLMs
- LLMs tend to hallucinate - which is fine if you want to draw an image of a cowboy panda on the moon. However, what is the impact of hallucination on more precision-dependent tasks?
- For instance, if you’re trying to calculate how much revenue was generated in Q3 of 2023 (like the project Soleda did), an exact number is crucial. Surprisingly, we discovered training a smaller model on high quality data is able to outperform an LLM that is 100x larger.
Theme #2: Latency as a Technical Constraint
Key Takeaway: Managing latency in AI applications is paramount to user experience that requires innovative solutions such as task parallelization and streaming outputs to enhance responsiveness and maintain user engagement.
- Latency emerged as another primary challenge for many teams. APIs like Cohere/GPT often took seconds - or even minutes - to respond when embedding a large dataframe (e.g., Arxiv papers). Additionally, post-processing tasks such as creating embeddings and filtering results could take more than 30 seconds or even a few minutes.
- Question and answering tasks involve N+1 API calls, with the last call requiring a large context. This can result in significant delays, especially when considering multiple sources. Trade-offs exist between answer completeness and response time.
- Considering users are accustomed to instant responses from Google, even minimal latency can impact the user experience. This phenomenon may also suggest why ChatGPT adds streaming output (e.g., as if someone were writing) to make it feel less slow.
- Some teams found solutions to improve speed by parallelizing tasks, such as calling the GPT API or downloading PDFs.
Theme #3: Memory Requirements
Key takeaway: Implementing advanced LLMs (e.g., LLaMa) can pose significant challenges due to the high computational resources required, necessitating careful planning and resource allocation
- Several teams intended to use LLaMa embeddings, Facebook's LLM. However, LLaMa weights were not available within the timeframe of the hackathon.
- Using LlaMa for Hackathon projects presented challenges due to its high memory requirements, demanding substantial RAM memory and 30GB of GPU memory, which was infeasible to overcome in just a few days.
Theme #4: User Experience and Interface Design
Key takeaway: Enhancing user steerability in chatbot experiences is crucial, but not always easy to be achieved. To approach this problem, dynamic query enrichment and innovative solutions allow for personalized and contextually aware interactions.
- Enhancing user steerability in chatbot experiences is essential because it directly influences user engagement, satisfaction, and the overall effectiveness of the AI system. With improved steerability, users can have more natural, contextually aware interactions that feel personalized to their needs.
- From a user's perspective, a chatbot that understands their intent and adapts its responses accordingly provides a far more satisfying and efficient interaction than one that follows a rigid, predetermined script. It can lead to better problem resolution, faster access to information, and a more engaging, human-like conversational experience.
- One team, “Distributed Wikipedia Expert” led by Tommaso Furlanello, developed two tools to enhance user steerability.
- The first tool allowed dynamic query enrichment by incorporating user-filtered results from vector search. The user interface allows to dynamically enrich queries to the language model with user-filtered results from vector search, with a focus over Cohere open sourced Wikipedia embeddings and source code of installed pip packages parsed with LibCST
- The second involved building a pipeline, which the team called, Automatic Perspective Prompting. This multi-step pipeline searches Wikipedia for topic-intersecting information and distills it into a system prompt to steer the chatbot personality towards that perspective. The team produced 400 such perspectives that we open sourced
Theme #5: Complexities and Innovations in Pipeline Creation
Key takeaway: Creating complex pipelines with multiple models requires innovative approaches
- Building a complex pipeline with hundreds of model calls, each with unpredictable API call completion times, required new forms of multi-threaded asynchronous scientific computing.
- This approach allows for the execution of multiple tasks in an overlapping time period, which is crucial when dealing with hundreds of models with varied completion times.
- Moreover, this scientific computing needed to align with a non-deterministic clock time. This suggests the need for flexibility in pipeline design. Real-world situations often present unpredictable scenarios, so pipelines need to adapt to different time requirements and conditions, rather than adhering to a strictly deterministic schedule.
Theme #6: Challenges with LLM Usage and Control
Key takeaway: The effective use and control of LLMs in context-specific scenarios (e.g., IF logic), present significant challenges that can be improved through innovative prompt design and frameworks
- One of the biggest challenges for some teams had to do with LLM usage and control through prompt in-context learning.
- For example, it was very difficult to include “IF logic” in the prompt to have the model respond differently in various thought process flows. For example, deciding to use a tool or not based on whether the mode’s assessment of its knowledge for a given question was a challenge. However, one team managed to implement IF logic into the prompts using a ReACT type of framework.
- Negation - a common issue in language models - posed another obstacle. LLMs struggle to handle negation, in addition to not being able to deny (e.g., “I cannot answer that”, “I do not know”). One team managed to design a prompt that forces the LLM to argue (see here for a relevant article on the “negation problem” with LLMs).
Who were the winners?
A total of ten groups presented the results after a rigorous week of hacking. Each presentation was evaluated and scored by three expert judges - Paige Fox, Adrian Bauer, and Olekansdr Paraska - who selected the three winners.
First Place: “LLMOps Stack on Kubernetes”
The first place winners deployed an end-to-end pipeline for data generation, fine-tuning, and deployment of an LLM using a personal finance use case. The team trained their model with RLHF using “r/personalfinance” data, as well as integrating their UI to capture user feedback so that it could be looped to improve the model in the future.
Team: Rahul Parundekar, Daniel Vainsencher, PhD, Shri Javadekar,, Yulin J. Kuo, Yada Pruksachatkun, and Bryan Davis.
Second place: “AI-Research Assistant”
The second place team built a chat-based researcher. The tool enables researchers to help users quickly explore relevant topics and retrieve answers to their questions from academic papers on Arxiv using LLMs.
Team: Kiyo (Kiyohito) Kunii, Yifan Yuan, and Hector Lopez Hernandez, PhD
Third place: “Soleda AI”
Third place went to Soleda, an AI-powered analytics agent trained on proprietary marketing conversations to perform NLU on user’s requests to generate business insights. The product showed tremendous value to growth marketing teams who must quickly respond to dynamic market conditions and adapt their strategies based on evolving analytics needs, equipping stakeholders with the tools to make data-driven decisions.
Team: Derek Chen
Acknowledgments
We would like to express our sincere gratitude to all the participants who contributed to the success of this LLM hackathon. Your efforts, innovative ideas, and enthusiasm to learn and collaborate played a pivotal role in making this event possible. Your contributions have not only elevated the hackathon itself, but have also expanded our collective knowledge and advanced innovation in this field.
We would also like to extend our thanks to our sponsors, Cohere and Modal Labs, for their generous support in providing credits and technical assistance to the participants throughout the event. Additionally, we are grateful to Richard Abrich from OpenAdapt.ai and his company for sponsoring some of the cash prizes.
Lastly, we extend our gratitude to our three judges - Paige Fox, Adrian Bauer, and Olekansdr Paraska - for taking the time to provide feedback to projects and enriching the Tribe AI community. Thank you!