https://store-images.s-microsoft.com/image/apps.10812.3282ddef-bca5-4ae9-b5c3-ecb955d0100a.ac5bc5de-ea53-4aca-8e45-01a719058981.fe3cf54f-49d3-4054-8f7c-84f03f3b1539

Jina Code Embeddings 0.5b

Jina AI

Jina Code Embeddings 0.5b

Jina AI

Efficient code embeddings from code generation models

jina-code-embeddings-0.5b is a 494 million parameter code embedding model designed for retrieving code from natural language queries, technical Q&A, and identifying similar code across languages. Built on Qwen2.5-Coder-0.5B backbone, it generates embeddings via last-token pooling and addresses the fundamental limitation of traditional code embedding models that rely on scarce aligned data like comments and docstrings. The model leverages abundant unaligned code and documentation used in LLM training, achieving state-of-the-art performance despite its compact size. It supports five task categories with specific instruction prefixes: NL2Code, TechQA, Code2Code, Code2NL, and Code2Completion. The model implements Matryoshka representation learning for truncatable embeddings, allowing flexible precision-resource trade-offs. Highlights: - Multilingual support (15+ programming languages) and compatibility with a wide range of domains, including web development, software development, machine learning, data science, and educational coding problems. - Task-specific instruction prefixes for NL2Code, Code2Code, Code2NL, Code2Completion, and Technical QA, which can be selected at inference time. - Flexible embedding size: dense embeddings are 896-dimensional by default but can be truncated to as low as 64 with minimal performance loss.