Benchmarks based on https://artificialanalysis.ai/models/gpt-4-turbo

TTFT (Time to first token) for each model depends on the size which directly impacts the latency of the model and the average response time. We plan on building out fine tuned models to address our requirements based on the flows and flexibility of the multiagent system.

Also depending on the task we will employ a cascading model to support good quality output but with a mixture of different model sizes to support the overall performance of the platform

https://lh7-us.googleusercontent.com/UCKip3w7xzMdP6gkCCxCDRREVAu0yImOGlNm-bUq066AK4lCKV4s2y_Ct0J5eMTL3j-h8w_nJ2XYYs24HiE0Mh4sE7Ak4Zg-Gldhd-G6cFcWKcIDReXPpogPKtiWsnL_6DOcqNEDHsZX9MsjNWgjE78

https://lh7-us.googleusercontent.com/7OQoTnCd4c7RJcRUFuw_P9bDYX6opk61nFKeTUHKDRWxALaUgMBx3y0XDSxNkC_bJZD5xc885GhA-5unko1mLNM3WLA8XEd_-YeasFz1LiKhaEWU9HAKGtGvEoPZojGB8TlJnpPSGMHIajGW2ReHqDo

https://lh7-us.googleusercontent.com/vCJLvogqaPYb_0MtLekKpN03v0YlMq3U12UhSow_oKBwC0oSN62AdKqGh9Q4e47xU3CtwvglZAmFCjPdsLJnlW6PK-F_WmJGAX-OFSd3cj3THDjKD1F5InxybaIDFVIUbjJFS3aKZ_F2ssuy-TnvIYo

On standard hardware the benchmarks are notably a bit slower with output but inference is relatively the same.

https://lh7-us.googleusercontent.com/f0-A6ciS_Yb8schzDEZsnXz-nirRLStx6jwTpppD6N9HxQ_lb-knnJDLvzsLI-dHTMkgLRAvQg4u39PbwXu2jJJrOJ60ZCiQ2UQjCosTYihFrSr-AJKAFZmlL6OXbp7lxGn_sTA3rVsI7tmD-4x3XA0

https://lh7-us.googleusercontent.com/uBrSmQP55u6D3_ckfirwfsRxNes_g5kKv00AiF4WdexNTplNwZgheq2SSbQL2yWGY__6nHT-pgL_2zIw5n0brhIu9SkAVvTzSL6ilmvmTE4P6zKwJTfFW4Tf_-N32gT9c-Lep83sFeKen8Hzhm7rg9w

https://lh7-us.googleusercontent.com/ep8w4bo78UpuVhtVvJStgxpXqJ3mpCBFEP0fh358__XGRUkTeikzInFXiSvQby92XJRgGSxLdm1D0jDZOdLZI2wupftdJ4yTcL-QPWncB7ChPZM15dhZEXR_PuwuKXsYZM6HUg3iJreaOm_19q09y38

https://lh7-us.googleusercontent.com/bQNCwms_ZKlEoElJE8WO5dMLTHb68czoMVExr9kGPILjwFy1S3dzrdgApdNMlPFXPqtw0k434RP1HsLF17ZYnqHsVrSHabaA1f2ebQMoumIsZ9S7B_O3Y2gbdcH4N_X2RJ0hOdxWGfQaeXIpkGOKvSE