AITech Interview with Narek Tatevosyan, Product Director at Nebius AI

Gain insights from Narek Tatevosyan, Product Director at Nebius AI, on overcoming AI challenges, enabling scalable compute, and advancing Gen AI with cutting-edge solutions. Hello Narek, Could you tell us about your professional journey and how it led you to your role as Product Director at Nebius AI? I started... The post AITech Interview with Narek Tatevosyan, Product Director at Nebius AI first appeared on AI-Tech Park.

AITech Interview with Narek Tatevosyan, Product Director at Nebius AI

Gain insights from Narek Tatevosyan, Product Director at Nebius AI, on overcoming AI challenges, enabling scalable compute, and advancing Gen AI with cutting-edge solutions.

Hello Narek, Could you tell us about your professional journey and how it led you to your role as Product Director at Nebius AI?

I started my career as a network engineer, working on large-scale projects like building broadcast networks for the Olympics in London and Sochi. While I enjoyed the technical challenges, I realised I wanted to solve broader real-world problems using technology, which led me to consulting and an MBA. In one key project, I developed a cloud platform business for a service provider in Kazakhstan, from business planning to deployment.

My interest in machine learning grew during this time, and I eventually joined Yandex Cloud, where I helped build a hyperscale platform from scratch. I quickly saw that it lacked product-market fit, so I worked with development teams to refocus on cloud-native solutions and DevOps, which led to the platform’s success.

After the war started, I moved to Nebius with a team of experts. Here, I combined my passion for AI with my cloud experience, becoming a Product Director. I now lead teams to create an AI Cloud platform, which lets me merge my love for technology with impactful business solutions.

Tell us in detail how AI technologies influence business operations.

I believe that most companies today, whether they’re online or in office, are already using AI in some form, and eventually, every company will be an AI-driven business.
AI influences productivity by helping people do more in less time. For example, generative AI can take meeting notes, create Jira tickets, and copilots for developers speed up coding. It also enhances customer experience by enabling support copilots to resolve issues faster and improve communication.
On the security side, companies that resist adopting AI often face “shadow AI” usage, which introduces extra risks. It’s better to embrace AI with best practices than ignore it.

What are the major challenges that Gen AI companies face, and how does Nebius AI work to make these challenges approachable?

One of the biggest challenges Gen AI companies face is the huge cost of compute resources, with some spending up to half of their budget on it. This means they need to maximize their investments by ensuring their infrastructure is set up quickly, runs efficiently, and is highly reliable.
At Nebius, we address these challenges by investing heavily in all three areas – ease of use, performance, and reliability – across our IaaS and platform services, helping companies get the most out of their resources.

How do you think ML engineers struggle with infrastructure management, and what do companies do to bridge this gap?

ML engineers aren’t trained to manage large infrastructure – they’re focused on innovating, training models, and optimising them for inference. Their role is to use the infrastructure, not to set it up or manage it.

For larger companies, this means having a dedicated DevOps or MLOps team to handle the infrastructure so the ML engineers can focus on their core tasks. Or use platform tools and services which makes

What are the skills that MLOps professionals need to effectively bridge the gap between ML engineering and infrastructure management?

MLOps professionals need a mix of skills to bridge the gap between ML engineering and infrastructure management. They must be proficient in setting up and managing large clusters using schedulers like Kubernetes or Slurm, handling shared storage, and optimising high-performance networks like RDMA. They also need a deep understanding of ML workflows, from model training to inference.
But managing the entire stack is complex and often too much for one person or team. That’s where cloud providers like Nebius step in, offering tools like Terraform modules, APIs, and managed services for Kubernetes and Slurm. These services help MLOps teams be more productive by reducing the heavy lifting of infrastructure management.

Multi-cloud strategy has become the norm across Gen AI companies; what advantages do you think these companies leverage by accessing multiple cloud providers?

Adopting a multi-cloud strategy offers several advantages for Gen AI companies, including flexibility, enhanced reliability, and cost optimisation. By leveraging multiple cloud providers, companies can distribute their workloads according to the strengths of each platform, ensuring that they get the best performance for their specific needs. This approach also mitigates the risks associated with vendor lock-in and provides redundancy, which is crucial for maintaining uptime.

At Nebius, we recognise the importance of this strategy and design our infrastructure to seamlessly integrate with multiple cloud environments, enabling our clients to take full advantage of the resources available to them – also we support open-source technologies like Kubernetes, tooling like terraform to integrate into multi cloud infra setups

Can you elaborate on the challenges that companies face when scaling up their compute resources in a single cloud environment, and how does a multi-cloud strategy alleviate these challenges?

Scaling compute resources within a single cloud environment can pose significant challenges, such as resource bottlenecks, unexpected cost surges, and limited flexibility in resource allocation. Companies may find themselves constrained by the capabilities of a single provider, which can hinder their ability to innovate. A multi-cloud strategy alleviates these challenges by allowing companies to optimise their resource usage across various platforms, ensuring they can scale efficiently and cost-effectively.

Also being multi cloud means using multi cloud tooling, which is usually opensource. Sometimes this tooling is not that user friendly as proprietary ones but adds flexibility to change cloud provider if some risks with one occur.

What kind of strategies do you utilize to make sure that Nebius AI is different in providing scalable compute resources?

We focus on providing scalable GPU compute resources with several key features that set us apart. As a cloud platform, not a bare metal provider, we enable you to quickly create interconnected H100 GPU clusters. We offer both compute services with Ubuntu VMs and a managed Kubernetes service for managing these nodes.

You can start using our platform immediately without complex KYC processes. Obtain H100 GPUs within minutes and H100 GPU clusters within days, not weeks like traditional vendors.

We believe that as a cloud provider, we must instill confidence in the security of your data. We provide shared filestore and object storage for data streaming and checkpoints, along with MLflow and OCI registries for storing models. Plus, our team of architects can assist in setting up and running other platform systems.

What really differentiates us is our focus on operational efficiency and cost savings. We offer services that streamline infrastructure setup, reduce maintenance costs, and ensure high reliability – hot spares are available in case of hardware failure, and we have proprietary tech to handle problem nodes. This means fewer failed jobs and more focus on ML tasks, not infrastructure management.

We also support easy scaling, allowing you to spin up new VMs on-demand or reserve GPUs for better pricing. Plus, our dedicated architects are experts in MLOps, ensuring you get the support you need for your AI infrastructure.

How do you make sure that the company provides deployment of machine learning tools and how do these tools streamline workflow for ML engineers?

We support a diverse set of ML tools because AI research is rapidly evolving. We want to support all the tools and technologies that customers are requesting and research these tools with them.

Our Kubernetes app engine allows you to package popular Kubernetes applications into easy-to-install packages that users can leverage to explore new technologies. Kubeflow, Ray, Slurm, Airflow, and others can be easily used with our tools.

We also partner with popular proprietary tool providers to benefit our users. We’ve supported Weights & Biases’ launch and are working on supporting Run AI in our platform.

What kind of future do you envision for Gen AI companies as they keep evolving with cloud technologies?

We are still in the early stages of Gen AI adoption. While current models are promising, they’re not yet advanced enough for widespread production use. The next major milestone will be achieved through a combination of innovative machine learning advancements and next-generation computing capabilities.
As a cloud provider, we’re committed to delivering cutting-edge compute power to the market as quickly as possible, integrating it into familiar services like Kubernetes to facilitate ML workloads.

With the emergence of new models, you’ll always need to run applications for them. Offerings like AI studios will enable you to leverage the best open-source models for your use cases, allowing you to develop applications without the need for extensive model training or inference infrastructure.

Also, we see that inference hardware market is growing – we see a myriad of new hardware vendors introducing innovative solutions that can run agents in seconds instead of tens of seconds or generate video content more efficiently. This will enable more practical applications, such as iterating video generation every second instead of waiting minutes. Ultimately, this could lead to the creation of production-scale full-motion pictures in a reasonable timeframe.

However, relying on the wrong hardware provider poses significant risks for companies. Cloud providers can mitigate these risks by helping customers harness the value of new technologies without committing to potentially outdated hardware.

Narek Tatevosyan,

Product Director at Nebius AI

Narek Tatevosyan, Product Director at Nebius AI, brings over 15 years of experience in the IT industry.

With a distinguished career focused on cloud and infrastructure development, Narek has successfully led multiple high-profile projects over the past decade. At Nebius AI, he spearheads the development and design of high-load services, including advanced Machine Learning (ML) applications.  Narek is currently dedicated to building a cutting-edge cloud platform for AI practitioners, aimed at optimising the model training lifecycle and inference.  

Explore AITechPark for the latest advancements in AI, IOT, Cybersecurity, AITech News, and insightful updates from industry experts!

The post AITech Interview with Narek Tatevosyan, Product Director at Nebius AI first appeared on AI-Tech Park.