Ứng dụng mẫu ASP.NET Core kết hợp với Azure Speech-to-Text

3 weeks ago

7 minutes read

Dưới đây là ứng dụng mẫu ASP.NET Core kết hợp với Azure Speech-to-Text để chuyển đổi file audio thành văn bản. Bạn có thể chạy trực tiếp sau khi cấu hình Azure Speech API Key.

Tôi sẽ tạo một project bao gồm:

Frontend: Upload file audio
Backend (ASP.NET Core API): Xử lý file audio, gửi đến Azure Speech API để chuyển đổi thành văn bản
Hướng dẫn cấu hình Azure Speech-to-Text

Chờ tôi một chút nhé! 🚀

Hướng dẫn sử dụng:

Cấu hình Azure Speech API
- Đăng ký Azure Speech Service
- Lấy Subscription Key & Region
- Thay thế YOUR_AZURE_SPEECH_KEY & YOUR_AZURE_REGION trong mã nguồn
Chạy API
- Mở terminal trong thư mục project
- Chạy lệnh: dotnet run
Gửi yêu cầu chuyển đổi
- Dùng Postman hoặc Frontend để gửi file POST /api/speech/convert

Azure Speech-to-Text là một dịch vụ trong Azure AI cung cấp khả năng chuyển đổi giọng nói thành văn bản, hỗ trợ cả phiên âm thời gian thực và xử lý hàng loạt. Để sử dụng dịch vụ này, bạn có thể thực hiện các bước cấu hình sau:

1. Tạo dịch vụ Speech trên Azure Portal:

Đăng nhập vào Azure Portal.
Tìm kiếm và chọn “Speech Services” hoặc “Cognitive Services”.
Nhấp vào “Create” và cung cấp các thông tin cần thiết như subscription, resource group, region và pricing tier.
Sau khi dịch vụ được triển khai, truy cập vào tài nguyên để lấy các khóa (keys) và điểm cuối (endpoints) cần thiết cho việc kết nối. (toolify.ai)

2. Tích hợp dịch vụ vào ứng dụng của bạn:

Sử dụng các SDK do Azure cung cấp để tích hợp chức năng chuyển đổi giọng nói thành văn bản vào ứng dụng của bạn. Azure hỗ trợ nhiều ngôn ngữ lập trình, bao gồm C#, C++, Java và nhiều ngôn ngữ khác. (azure.microsoft.com)

Mức phí:
Azure Speech-to-Text áp dụng mô hình thanh toán dựa trên mức sử dụng, không yêu cầu chi phí trả trước. Cụ thể, bạn sẽ trả phí dựa trên số giờ âm thanh được phiên âm hoặc dịch thuật đối với chức năng chuyển giọng nói thành văn bản và dịch giọng nói. Để biết thông tin chi tiết về giá cả, bạn có thể tham khảo trang định giá của Azure AI Speech. (azure.microsoft.com)

Lưu ý:
Để tránh việc yêu cầu bị giới hạn hoặc bị từ chối do vượt quá hạn mức, bạn nên tham khảo các hạn mức và giới hạn của dịch vụ Speech trong tài liệu của Azure. (learn.microsoft.com)

Để bắt đầu nhanh chóng, bạn có thể thử nghiệm chức năng chuyển giọng nói thành văn bản trực tiếp trên Speech Studio mà không cần viết mã. (learn.microsoft.com)

Có một số dịch vụ API miễn phí hỗ trợ chuyển đổi giọng nói thành văn bản mà bạn có thể sử dụng thay thế Azure Speech-to-Text. Dưới đây là một số lựa chọn phổ biến:

1. Vosk API (Offline, Miễn phí & Mã nguồn mở)

Ưu điểm: Hoạt động offline, không tốn phí sử dụng API, hỗ trợ nhiều ngôn ngữ.
Nhược điểm: Độ chính xác có thể không cao bằng các dịch vụ AI thương mại.
Cách dùng: Cài đặt thư viện Vosk và sử dụng nó trong ứng dụng của bạn.
Hướng dẫn nhanh (C#/.NET dùng Python Wrapper): using System.Diagnostics; ProcessStartInfo start = new ProcessStartInfo(); start.FileName = "python"; start.Arguments = "vosk_transcribe.py your_audio_file.wav"; start.UseShellExecute = false; start.RedirectStandardOutput = true; using (Process process = Process.Start(start)) { using (StreamReader reader = process.StandardOutput) { string result = reader.ReadToEnd(); Console.WriteLine(result); } } 👉 Tài liệu Vosk

2. Google Speech-to-Text (Có miễn phí giới hạn)

Ưu điểm: Độ chính xác cao, hỗ trợ nhiều ngôn ngữ, có bản miễn phí.
Nhược điểm: Cần tạo tài khoản Google Cloud, miễn phí 60 phút/tháng, sau đó tính phí.
Cách dùng:
- Đăng ký tài khoản Google Cloud.
- Bật API Google Speech-to-Text.
- Lấy API Key và gọi API.
Gọi API nhanh bằng HTTP: POST https://speech.googleapis.com/v1/speech:recognize?key=YOUR_API_KEY Content-Type: application/json { "config": { "encoding": "LINEAR16", "sampleRateHertz": 16000, "languageCode": "en-US" }, "audio": { "uri": "gs://your-bucket/audio.wav" } } 👉 Hướng dẫn Google Speech-to-Text

3. OpenAI Whisper API (Có miễn phí, nhưng cần tài khoản)

Ưu điểm: AI mạnh, hỗ trợ nhiều ngôn ngữ, miễn phí thử nghiệm.
Nhược điểm: Cần tài khoản OpenAI, giới hạn miễn phí, cần kết nối mạng.
Cách dùng:
- Đăng ký tài khoản OpenAI.
- Gọi API bằng C#:
using System.Net.Http; using System.Text; using System.Threading.Tasks; class Program { static async Task Main(string[] args) { using var client = new HttpClient(); client.DefaultRequestHeaders.Add("Authorization", "Bearer YOUR_OPENAI_API_KEY"); var content = new MultipartFormDataContent { { new StringContent("whisper-1"), "model" }, { new ByteArrayContent(System.IO.File.ReadAllBytes("your_audio_file.mp3")), "file", "audio.mp3" } }; var response = await client.PostAsync("https://api.openai.com/v1/audio/transcriptions", content); var responseString = await response.Content.ReadAsStringAsync(); Console.WriteLine(responseString); } } 👉 Hướng dẫn OpenAI Whisper API

4. IBM Watson Speech-to-Text (Miễn phí 500 phút/tháng)

Ưu điểm: Có miễn phí 500 phút/tháng, hỗ trợ nhiều định dạng audio.
Nhược điểm: Cần đăng ký IBM Cloud.
Cách dùng:
- Đăng ký IBM Cloud.
- Tạo dịch vụ Speech-to-Text.
- Gọi API bằng HTTP:
POST https://api.us-south.speech-to-text.watson.cloud.ibm.com/instances/YOUR_INSTANCE_ID/v1/recognize Authorization: Basic YOUR_API_KEY Content-Type: audio/wav 👉 IBM Watson Speech-to-Text

Tóm tắt so sánh các API miễn phí

API	Giới hạn miễn phí	Ưu điểm	Nhược điểm
Vosk	Không giới hạn	Miễn phí, offline	Cần cài đặt, độ chính xác trung bình
Google Speech-to-Text	60 phút/tháng	Độ chính xác cao	Cần tài khoản Google Cloud
OpenAI Whisper	Miễn phí thử nghiệm	AI mạnh, đa ngôn ngữ	Giới hạn API miễn phí
IBM Watson	500 phút/tháng	Hỗ trợ nhiều định dạng	Cần đăng ký IBM Cloud

Nếu bạn không muốn mất phí và cần chạy offline, Vosk là lựa chọn tốt nhất.
Nếu bạn cần AI mạnh, OpenAI Whisper hoặc Google Speech-to-Text là lựa chọn hợp lý.

Bạn muốn mình hướng dẫn chi tiết cách cài đặt một trong các dịch vụ trên không? 🚀

Bạn có thể làm điều này bằng cách:

Tải video từ YouTube (chỉ lấy âm thanh).
Chuyển đổi audio thành văn bản bằng OpenAI Whisper API (hoặc dịch vụ khác).

Dưới đây là cách làm bằng C#:

1. Cài đặt thư viện cần thiết

Mở Package Manager Console trong Visual Studio và chạy:

dotnet add package VideoLibrary
dotnet add package NAudio
dotnet add package Xabe.FFmpeg

🔹 VideoLibrary – Tải video từ YouTube.
🔹 NAudio – Xử lý âm thanh.
🔹 Xabe.FFmpeg – Chuyển đổi định dạng audio.

2. Code tải âm thanh từ YouTube và chuyển thành văn bản

using System;
using System.IO;
using System.Net.Http;
using System.Threading.Tasks;
using VideoLibrary;
using Xabe.FFmpeg;

class Program
{
    static async Task Main(string[] args)
    {
        string videoUrl = "https://www.youtube.com/watch?v=YOUR_VIDEO_ID";
        string audioPath = await DownloadAudioFromYouTube(videoUrl);
        string transcription = await TranscribeAudio(audioPath);
        Console.WriteLine("Transcription:\n" + transcription);
    }

    static async Task<string> DownloadAudioFromYouTube(string url)
    {
        Console.WriteLine("Downloading audio...");
        var youtube = YouTube.Default;
        var video = await youtube.GetVideoAsync(url);
        string videoPath = Path.Combine(Directory.GetCurrentDirectory(), video.FullName);
        await File.WriteAllBytesAsync(videoPath, await video.GetBytesAsync());

        string audioPath = Path.ChangeExtension(videoPath, ".mp3");
        await ConvertToMp3(videoPath, audioPath);
        File.Delete(videoPath); // Xóa video sau khi chuyển đổi
        return audioPath;
    }

    static async Task ConvertToMp3(string inputPath, string outputPath)
    {
        Console.WriteLine("Converting to MP3...");
        await FFmpeg.Conversions.New()
            .AddParameter($"-i \"{inputPath}\" -q:a 0 -map a \"{outputPath}\"")
            .Start();
    }

    static async Task<string> TranscribeAudio(string audioPath)
    {
        Console.WriteLine("Transcribing audio...");
        using var client = new HttpClient();
        client.DefaultRequestHeaders.Add("Authorization", "Bearer YOUR_OPENAI_API_KEY");

        var content = new MultipartFormDataContent
        {
            { new StringContent("whisper-1"), "model" },
            { new ByteArrayContent(await File.ReadAllBytesAsync(audioPath)), "file", "audio.mp3" }
        };

        var response = await client.PostAsync("https://api.openai.com/v1/audio/transcriptions", content);
        return await response.Content.ReadAsStringAsync();
    }
}

3. Hướng dẫn sử dụng

Thay thế "YOUR_VIDEO_ID" bằng ID video YouTube cần tải.
Thay thế "YOUR_OPENAI_API_KEY" bằng API Key của OpenAI Whisper.
Chạy chương trình, nó sẽ:
- Tải video từ YouTube.
- Trích xuất âm thanh và chuyển thành .mp3.
- Gửi file lên OpenAI Whisper API để nhận văn bản.

4. Kết quả

Sau khi chạy, bạn sẽ nhận được văn bản chuyển đổi từ audio của video YouTube.

Bạn có muốn mình hướng dẫn cách dùng Google Speech-to-Text API thay thế không? 🚀

Ý tưởng của bạn giống như tạo Storyboard từ nội dung âm thanh trong video. Dưới đây là cách thực hiện:

📌 Quy trình:

Tải video từ YouTube & trích xuất âm thanh
Chuyển âm thanh thành văn bản (Speech-to-Text)
Tách văn bản thành các đoạn mô tả ngắn
Dùng AI tạo hình ảnh từ từng đoạn văn (DALL·E)
Xuất hình ảnh kèm nội dung dưới dạng storyboard

🚀 Code C# thực hiện toàn bộ quy trình

Bạn cần OpenAI API (Whisper + DALL·E) và FFmpeg để xử lý audio.

1️⃣ Cài đặt thư viện cần thiết

Chạy lệnh sau trong Package Manager Console:

dotnet add package VideoLibrary
dotnet add package Xabe.FFmpeg
dotnet add package Newtonsoft.Json
dotnet add package System.Drawing.Common

2️⃣ Code C# hoàn chỉnh

using System;
using System.IO;
using System.Net.Http;
using System.Threading.Tasks;
using System.Collections.Generic;
using Newtonsoft.Json;
using VideoLibrary;
using Xabe.FFmpeg;

class Program
{
    static async Task Main(string[] args)
    {
        string videoUrl = "https://www.youtube.com/watch?v=YOUR_VIDEO_ID";
        string apiKey = "YOUR_OPENAI_API_KEY";

        string audioPath = await DownloadAudioFromYouTube(videoUrl);
        string text = await TranscribeAudio(audioPath, apiKey);
        List<string> textSegments = SplitText(text, 5); // Tách thành từng đoạn

        Console.WriteLine("Generating storyboard...");
        foreach (var segment in textSegments)
        {
            string imagePath = await GenerateImageFromText(segment, apiKey);
            Console.WriteLine($"📷 Image generated: {imagePath}");
        }

        Console.WriteLine("✅ Storyboard completed!");
    }

    static async Task<string> DownloadAudioFromYouTube(string url)
    {
        Console.WriteLine("Downloading audio...");
        var youtube = YouTube.Default;
        var video = await youtube.GetVideoAsync(url);
        string videoPath = Path.Combine(Directory.GetCurrentDirectory(), video.FullName);
        await File.WriteAllBytesAsync(videoPath, await video.GetBytesAsync());

        string audioPath = Path.ChangeExtension(videoPath, ".mp3");
        await ConvertToMp3(videoPath, audioPath);
        File.Delete(videoPath); 
        return audioPath;
    }

    static async Task ConvertToMp3(string inputPath, string outputPath)
    {
        Console.WriteLine("Converting to MP3...");
        await FFmpeg.Conversions.New()
            .AddParameter($"-i \"{inputPath}\" -q:a 0 -map a \"{outputPath}\"")
            .Start();
    }

    static async Task<string> TranscribeAudio(string audioPath, string apiKey)
    {
        Console.WriteLine("Transcribing audio...");
        using var client = new HttpClient();
        client.DefaultRequestHeaders.Add("Authorization", "Bearer " + apiKey);

        var content = new MultipartFormDataContent
        {
            { new StringContent("whisper-1"), "model" },
            { new ByteArrayContent(await File.ReadAllBytesAsync(audioPath)), "file", "audio.mp3" }
        };

        var response = await client.PostAsync("https://api.openai.com/v1/audio/transcriptions", content);
        return await response.Content.ReadAsStringAsync();
    }

    static List<string> SplitText(string text, int sentencesPerSegment)
    {
        List<string> segments = new List<string>();
        var sentences = text.Split(new[] { '.', '!', '?' }, StringSplitOptions.RemoveEmptyEntries);
        for (int i = 0; i < sentences.Length; i += sentencesPerSegment)
        {
            segments.Add(string.Join(". ", sentences, i, Math.Min(sentencesPerSegment, sentences.Length - i)) + ".");
        }
        return segments;
    }

    static async Task<string> GenerateImageFromText(string prompt, string apiKey)
    {
        Console.WriteLine($"Generating image for: {prompt}");
        using var client = new HttpClient();
        client.DefaultRequestHeaders.Add("Authorization", "Bearer " + apiKey);

        var requestBody = new { prompt = prompt, n = 1, size = "1024x1024" };
        var content = new StringContent(JsonConvert.SerializeObject(requestBody), System.Text.Encoding.UTF8, "application/json");

        var response = await client.PostAsync("https://api.openai.com/v1/images/generations", content);
        var jsonResponse = await response.Content.ReadAsStringAsync();
        dynamic result = JsonConvert.DeserializeObject(jsonResponse);

        string imageUrl = result.data[0].url;
        string imagePath = Path.Combine(Directory.GetCurrentDirectory(), "image_" + Guid.NewGuid() + ".png");

        using (var imageClient = new HttpClient())
        {
            var imageBytes = await imageClient.GetByteArrayAsync(imageUrl);
            await File.WriteAllBytesAsync(imagePath, imageBytes);
        }

        return imagePath;
    }
}