Asankhaya Sharma PRO

codelion

AI & ML interests

AI/ML, Dev Tools and Application Security

Organizations

Posts 6

view post
Post
1538
WorkerSafetyQAEval: A new benchmark to evaluate worker safety domain question and answering

Happy to share a new benchmark on question and answers for worker safety domain. The benchmark and leaderboard is available at
codelion/worker-safety-qa-eval

We evaluate popular generic chatbots like ChatGPT and HuggingChat on WorkerSafetyQAEval and compare it with a domain specific RAG bot called Securade.ai Safety Copilot - codelion/safety-copilot It highlights the importance of having domain specific knowledge for critical domains like worker safety that require high accuracy. Securade.ai Safety Copilot achieves ~97% on the benchmark setting a new SOTA.

You can read more about the Safety Copilot on https://securade.ai/blog/how-securade-ai-safety-copilot-transforms-worker-safety.html
view post
Post
1072
After the announcements yesterday, I got a chance to try the new gemini-1.5-flash model from @goog1e , it is almost as good as gpt-4o on the StaticAnalaysisEval ( patched-codes/static-analysis-eval) It is also a bit faster than gpt-4o and much cheaper.

I did run into a recitation flag with an example in the dataset where the api refused to fix the vulnerability and flagged the input as using copyrighted content. This is something you cannot unset even with the safety filters and seems to be an existing bug https://issuetracker.google.com/issues/331677495

But overall you get gpt-4o level performance for 7% the price, we are thinking of making it default in patchwork - https://github.com/patched-codes/patchwork You can use the google_api_key and model options to choose gemini-1.5-flash-latest to run it with patchwork.