Caue Paiva caue-paiva

Hello! My name is Cauê and I am a computer science student at USP, Software Engineer at Uber, Founder of the TaCertoIssoAI AI-Powered Fact-checking Project and Undergrad Researcher in AI and Data Science.

🔭 Currently, my main interest is developing GenAI projects, such as using Agentic AI systems and Data Analytics to fight misinformation on whatsapp
⚙️ Besides AI and Machine Learning, I am also highly interested in Software Engineering, i am proficient in Golang and Python (FastAPI) for back-end microservices and typescript for front-ends and NodeJS servers as well.
🌱 I am learning how to use various technologies, especially in the world of Agentic AI for Applications and Software Development a few examples are: Claude Code, LangChain, LangGraph, Google Cloud, AWS, among others.
📫 You can reach me at the email [email protected]
🧑‍💻 My Linkedin: https://www.linkedin.com/in/cauepaiva/

My projects

AI-Powered Platform for combating misinformation on Whatsapp 📱 (TaCertoIssoAI)

An end-to-end AI platform that fights misinformation where it spreads: inside WhatsApp. It uses LLMs with retrieval-augmented generation to fact-check messages with evidence-grounded, source-cited outputs, plus an analytics layer to study misinformation trends at scale. The platform also classifies content into structured topic hierarchies using semantic embeddings for better exploration and monitoring, you can access the Data Analytics platform at https://tacertoissoai.com.br.

The project earned national recognition as a Top 3 winner out of 173 projects in the AI4Good program, with a funded invitation to present the project at MIT and Harvard. The project currently counts with around 10k distinct users and over 15k verified messages on WhatsApp.

🏎️ AWS DeepRacer Student Competition 2024 🏎️

In 2024, AWS organized the AWS DeepRacer competition in Brazil, bringing together university students from across the country. The competition involved training autonomous race cars to compete on both virtual and physical tracks, using deep learning and reinforcement learning algorithms to achieve the fastest lap times.

I was the national champion of the competition and received one of the race cars as my trophy. I later donated it to a robotics lab at my university, USP-ICMC, so that other students could use it for research, learning, and future competitions.

Code for the reward functions i used to train my models, as well as data analysis code i used to study my performance on the tracks can be found in this repo, there is no nice public readme yet because the repo was private for a long time due to it being from a competition i was part of.

Projects i develop as part of a São Paulo State Research Foundation (FAPESP) R&D grant program

Data Warehouse and automatic ETL pipeline for extracting and analyzing public brazilian goverment data with interactive Dashboards

This project aims to develop a Data Warehouse (DW) that consolidates multiple public government data points over several years, focusing on socio-economic indicators. The DW will support analytical queries and time-series analysis, providing decision-makers with deeper insights into areas such as Economic Activity, Environmental Policies and Damage, and Public Health. Additionally, the project features an ETL pipeline to automate the collection, transformation, and loading of data from public sources into the DW. The end goal is to use the DW too serve interactive dashboards to allow for easier analysis of this data.

Architecture of the Project

Modules of the Project

Automatic ETL pipeline for extracting, cleaning and processing public brazilian goverment data using APIs and Webscrapping

Python and SQL scripts related to the Data Warehouse, its schema and the insertion and retrieval of Data

Projects i develop as part of a brazilian goverment R&D grant program (PIBIT CNPq)

Educational Chatbot for Brazilian high school students

The project builds upon the educational capabilities of Large Language Models (ex: GPT-3.5 and GPT-4) for education ,while also mitigating weaknesses such as hallucination and lack of knowledge about certain subjects and tests within the brazilian university admittance standardized test (ENEM).

To achieve these results an LLM application, using openAI models (gpt-3.5 turbo or gpt-4), along with aditional modules, such as internet search and retrieval augmented generation (RAG) for extra functionality, was developed.

According to feedback, over 60% of users said our solution has better and more accurate answers than chatGPT

UI of the AI-powered education platform, allowing users to select custom subjects for chats, such as Math and History. This topic selection also applied to the RAG system, enabling retrieval over ENEM questions for the chosen subject.

CustomGPTs using APIs hosted on AWS

Implementation of the Educational Chatbot described above but using the new OpenAI customGPTs service.

Helpful Prompts and data extracted from official sources about the ENEM test was used for better results.

For the purpose of RAG over ENEM test questions a GPT action and its associated API was used, the API is hosted on AWS API gateway and uses a Lambda Function for taking user inputs, embedding them with openAI embeddings and then querying Qdrant vectorDB for the N questions more similar to user input, with N being the number of questions the user asked.

ETL pipeline for processing PDFs and feeding data into vectorDBs

For the educational chatbots, both the website and the customGPT version, i needed a large dataset of ENEM questions and their correct answers for the purpose of RAG and reduce LLM hallucinations (such as giving the wrong answer to a question) but no such large scale data was available online.

In such context i created this project (which my repo with the most github stars!!) , which combines PDF/data mining through libraries like PyMuPDF2 to transform the ENEM pdf into either textual data or into JSON files (Extraction and Transform part) and then a Qdrant VectorDB loader to load the data into the vectorstore (Load part). That combination is able to process either single tests PDFs (and their associated answer PDFs) or entire folders with multiple tests, loading hundreds of questions at once, all while providing metadata and stats about the extraction process (number of extracted questions per year and subject) to a CSV file, through a Pandas DataFrame.

Projects i developed to learn new technologies and concepts!

Projects about: Python for Data Science and Engineering

Crypto Data ETL pipeline for Analytical Dashboards with Airflow and AWS

This project aims to collect and update data on cryptocurrencies like Bitcoin and Ethereum, storing the information in CSV files. These files cover extensive periods of trading data collected from the Binance US API.

The main technologies used are AWS Cloud (Lambda, API gateway, EC2 and S3), Apache Airflow for Data pipeline orchestration, Python and Pandas for manipulating the data and Plotly for the creation of nice plots of the Data.

Heres the architecture of the Project/Pipeline:

Here is one example plot generated by the project, taken from one of mine linkedin posts

Plotting tool for analysis of public violence data in Brazil

This project implements a Python script for data analysis and visualization with plots based on data from IPEA (Institute for Applied Economic Research) and its "Map of Violence" database. It allows analyzing data such as Homicide and Suicide Rates by state and year. The dataset also includes gender-separated data, enabling a historical series analysis of violence against women.

The program automatically generates plots, one for each specified year, based on the retrieved data. The plots will be saved in the current directory.

Projects about: C, C++ and low-level programming

Robot with Computer Vision and Speech Recognition

Project developed in group for an eletronics class in university and presented on the Undergrad Research Symposium of Universidade of São Paulo (SIICUSP 2023).

The goal of this effort was the integrate Machine Learning Models , such as Computer vision and text classification, with a robot powered by a microcontroller (ESP-32)

My main contribution was with software development for the ESP-32 embedded systems, using C++ and modules such as Wi-Fi HTTP request handlers.

Heres the certificate for the Symposium

Interactive terminal game using Threads in C++

This project was developed as part of an undergrad course in Operating Systems, with the main goal being to create an interactive game displayed on the terminal using Threads and Mutexes to allow for concurrent operations, such as rendering the game board, getting user input, moving the game elements among others.

One of the main benefits from this project was my further familiarization with C++ stdlib functions, classes and structures for working in a multi-thread environment.

Technologies i am familiar with:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly