Setting Up an NLP Inference Server with Hugging Face: A Comprehensive Guide

6 min readDec 12, 2023

Setting up an NLP inference server often feels like navigating a labyrinth of complexity. The challenge isn’t just in harnessing the power of advanced models; it’s in the intricate dance of managing threads with GPU acceleration and the delicate balancing act of handling concurrent requests. For developers and data scientists, these tasks can quickly turn from exciting to tedious, presenting a barrier that obscures the true potential of NLP applications.

Enter the realm of olympipe, my custom-engineered pipeline engine designed to alleviate these very challenges. With a keen focus on simplifying the intricate backend processes, olympipe transforms the ordeal of setting up an NLP inference server from a daunting task into a streamlined, manageable process. In this article, I’ll share how olympipe not only eases the pain points associated with thread and request management but also empowers you to unlock the full capabilities of your NLP models with unparalleled ease and efficiency. Whether you’re a seasoned expert or just starting out, olympipe is here to change the way you think about and interact with NLP inference servers.

Code Overview

In this section, let’s delve into the key components of the code that powers our NLP inference server, highlighting the functionality and structure of the NLPModel class, the integration with the Pipeline, and the process of handling and responding to HTTP requests.

The NLPModel Class

class NLPModel:
    def __init__(self, model_name: str, use_half: bool = True):
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        print(f"Loading inference model {model_name} on {self.device}")
        if use_half:
            torch.set_default_dtype(torch.float16)
        self.model = transformers.AutoModelForCausalLM.from_pretrained(model_name).to(
            self.device
        )
        self.tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
        print(f"Model {model_name} loaded.")

    def generate_response(
        self, packet: Tuple[socket, Dict[str, Any]]
    ) -> Tuple[socket, Dict[str, Any]]:
        connection, data = packet

        prompt = data.get("prompt")
        max_new_tokens = data.get("max_new_tokens", 100)
        num_return_sequences = data.get("num_return_sequences", 1)

        outputs = self.model.generate(
            self.tokenizer.encode(
                prompt, return_tensors="pt", add_special_tokens=False
            ).to(self.device),
            max_new_tokens=max_new_tokens,
            num_return_sequences=num_return_sequences,
        )

        responses = [
            self.tokenizer.decode(output, skip_special_tokens=True)
            for output in outputs
        ]

        return connection, {"responses": responses}

The NLPModel class lies at the heart of our server's functionality. It's a fairly standard implementation that serves to instantiate our NLP model, but with some crucial optimizations for performance and flexibility. Let's break down its main components:

Model Initialization: Upon creating an instance of NLPModel, the specified model from Hugging Face is loaded. This process is streamlined and optimized for performance, with an option to use torch.float16 for computations, enhancing speed, especially on GPUs.
The generate_response Method: This method is where the magic happens. For each incoming packet, generate_response takes the prompt from the packet and uses the model to generate a response. It handles the intricacies of tokenizing the input prompt, feeding it to the model, and then decoding the generated response. The beauty of this method lies in its simplicity from a user's perspective, while internally managing the complex interactions with the NLP model.

Integration with Pipeline

if __name__ == "__main__":
    from torch.multiprocessing import set_start_method

    set_start_method("spawn")

    HF_MODEL_NAME = "Intel/neural-chat-7b-v3-2"
    Pipeline.server(
        [("POST", "/process", lambda x: x)],
        port=8000,
        host="localhost",
    ).class_task(NLPModel, NLPModel.generate_response, [HF_MODEL_NAME]).task(
        return_http_answer
    ).wait_for_completion()

The use of the Pipeline module is a strategic choice that greatly simplifies the request-handling process. Here's how it's integrated:

A packet fed in the Pipeline will go through each step of the pipe (see olympipe for more examples), here class_task(NLPModel) then task(return_http_answer)
Defining the Route and Method: In our server setup, we define a specific route (e.g., /process) and associate it with a preprocessing method (in this case, lambda x:x). This setup ensures that when a POST request is made to the specified route, the embodied data will have the desired format for the pipeline next task (here, class_task)

The `return_http_answer` Function

def return_http_answer(p: Tuple[socket, Dict[str, Any]]):
    connection, responses = p
    send_json_response(connection, responses)

The journey of a request concludes with the return_http_answer function, which plays a vital role in the server's HTTP response process:

Completing the HTTP Request: This function takes the output from the generate_response method and packages it into a format suitable for an HTTP response.
Sending the Response: It then sends the response back to the client, effectively closing the loop. This step is crucial as it ensures that the client receives the generated NLP response in a timely and efficient manner.

Advantages

The core of our server’s NLP capabilities lies in its integration with Hugging Face models, providing a vast repository of pre-trained models which are not only state-of-the-art but also incredibly user-friendly.
Handling Multiple Requests: One of the significant challenges in setting up an NLP server is managing concurrent requests efficiently. Our server is designed to handle multiple requests simultaneously without compromising on performance.
Integration of Multiple Models: Another remarkable feature of our server is its ability to integrate multiple Hugging Face models into the same pipeline. This flexibility opens up a plethora of possibilities for elaborate text processing and analysis. Users can chain different models for complex workflows, such as first using a language detection model followed by a language-specific NLP model, all within the same pipeline.
Two-Stage Request Handling: The server manages HTTP requests in two distinct stages — opening and closing. This bifurcation provides total control over the process that occurs between receiving a request and sending a response.
Faster average response: chaining the processes with olympipe allows to have the highests possible throughput by spending almost no time waiting on other concurrent threads to finish all the steps. Instead, each step is processed as soon as it is available.

Usage Example

Setup:

poetry add transformers torch olympipe
# or
pip install transformers torch olympipe

Here is the full server code:

from socket import socket
from typing import Any, Dict, Tuple

import torch
import transformers
from olympipe import Pipeline
from olympipe.helpers.server import send_json_response


class NLPModel:
    def __init__(self, model_name: str, use_half: bool = True):
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        print(f"Loading inference model {model_name} on {self.device}")
        if use_half:
            torch.set_default_dtype(torch.float16)
        self.model = transformers.AutoModelForCausalLM.from_pretrained(model_name).to(
            self.device
        )
        self.tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
        print(f"Model {model_name} loaded.")

    def generate_response(
        self, packet: Tuple[socket, Dict[str, Any]]
    ) -> Tuple[socket, Dict[str, Any]]:
        connection, data = packet

        prompt = data.get("prompt")
        max_new_tokens = data.get("max_new_tokens", 100)
        num_return_sequences = data.get("num_return_sequences", 1)

        with torch.no_grad():
            outputs = self.model.generate(
                self.tokenizer.encode(
                    prompt, return_tensors="pt", add_special_tokens=False
                ).to(self.device),
                max_new_tokens=max_new_tokens,
                num_return_sequences=num_return_sequences,
            )

        responses = [
            self.tokenizer.decode(output, skip_special_tokens=True)
            for output in outputs
        ]

        return connection, {"responses": responses}


def return_http_answer(p: Tuple[socket, Dict[str, Any]]):
    connection, responses = p
    send_json_response(connection, responses)


if __name__ == "__main__":
    from torch.multiprocessing import set_start_method

    set_start_method("spawn")

    HF_MODEL_NAME = "Intel/neural-chat-7b-v3-2"
    Pipeline.server(
        [("POST", "/process", lambda x: x)],
        port=8000,
        host="localhost",
    ).class_task(NLPModel, NLPModel.generate_response, [HF_MODEL_NAME]).task(
        return_http_answer
    ).wait_for_completion()

And the client code:

import multiprocessing
import random
from typing import Any
import requests


if __name__ == "__main__":

    def joke_request():
        adjectives = [
            "nice",
            "nasty",
            "ugly",
            "bad",
            "witty",
            "wicked",
            "correct",
            "deviant",
        ]

        adj1 = random.choice(adjectives)
        adj2 = random.choice(adjectives)

        data = {
            "prompt": f"You are a {adj1} chatbot, Tell me a {adj2} joke!",
            "max_new_tokens": 200,
        }
        return data

    def ask_joke(data: Any):
        try:
            response = requests.post("http://127.0.0.1:8000/process", json=data)

            if response.status_code == 200:
                result = response.json()
                print(f"Processed result: {result}")
                return result
            else:
                print(f"Error: {response.status_code}")
        except Exception as e:
            print(e)

    # Use this to ask 4 jokes concurrently
    out = []
    with multiprocessing.Pool(4) as p:
        res = p.starmap(ask_joke, [(joke_request(),) for _ in range(40)])
        out.append(res)

    print(out)

Conclusion

In this article, we’ve explored the intricacies and advantages of setting up an NLP inference server using the olympipe pipeline engine. From the simplicity of initializing HuggingFace models to the sophisticated handling of concurrent requests and the flexible integration of multiple models, this solution redefines efficiency in NLP server setups. The two-stage HTTP request management further ensures reliability and control, making the server not just powerful, but also remarkably user-friendly.

The implementation showcased here represents more than just a technical achievement; it’s a testament to the potential of NLP technology when combined with innovative software architecture. The reduction in total processing time for concurrent tasks, as illustrated by our pipeline model, underscores the importance of efficient design in handling complex computational tasks.

Try It Out: I encourage you to implement this solution in your projects. Experience firsthand the ease and efficiency it brings to NLP model deployment and management. Your thoughts and experiences are invaluable. 🙏 Please share your feedback, suggestions, or queries in the comments section. Let’s learn and grow together in this exciting field.

Thank you for reading, and I look forward to seeing the innovative ways you’ll leverage this solution in your NLP endeavors!