OpenAI TTS 在 Spring AI 指南

1. 簡介

如今，應用程序從神經網絡集成中獲益良多，例如知識庫、助手或分析引擎。一個實際的應用案例是將文本轉換為語音。這個過程，被稱為文本轉語音 (TTS)，使自動化音頻內容創建成為可能，並使用自然、逼真的人類聲音。

現代 TTS 系統使用深度學習來處理髮音、節奏、語調，甚至情感。與早期的基於規則的方法不同，這些模型是在大型數據集上訓練的，可以生成富有表現力、多語言的語音，這對於虛擬助手或包容性教育平台等全球應用來説非常理想。

在本教程中，我們將探索如何使用 OpenAI 文本轉語音與 Spring AI 的集成。

2. 依賴項和配置

我們將首先添加 spring-ai-starter-model-openai 依賴項：依賴。

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-starter-model-openai</artifactId>
    <version>1.1.0</version>
</dependency>

接下來，我們將配置 Spring AI 屬性以用於 OpenAI 模型：

spring.ai.openai.api-key=${OPENAI_API_KEY}
  
 spring.ai.openai.audio.speech.options.model=tts-1
  
 spring.ai.openai.audio.speech.options.voice=alloy
  
 spring.ai.openai.audio.speech.options.response-format=mp3
  
 spring.ai.openai.audio.speech.options.speed=1.0

為了使用 OpenAI API，必須設置 Open AI API 密鑰。還需要指定文本轉語音模型的名稱 text-to-speech model，語音，響應格式和音頻速度。

3. 構建文本轉語音應用程序

現在，我們將構建我們的文本轉語音應用程序。首先，我們將創建 TextToSpeechService:。

@Service
public class TextToSpeechService {

    private OpenAiAudioSpeechModel openAiAudioSpeechModel;

    @Autowired
    public TextToSpeechService(OpenAiAudioSpeechModel openAiAudioSpeechModel) {
        this.openAiAudioSpeechModel = openAiAudioSpeechModel;
    }

    public byte[] makeSpeech(String text) {

        SpeechPrompt speechPrompt = new SpeechPrompt(text);

        SpeechResponse response = openAiAudioSpeechModel.call(speechPrompt);
        return response.getResult().getOutput();
    }
}

這裏我們使用 OpenAiAudioSpeechModel，Spring AI 使用了我們的配置。我們還定義了 makeSpeech() 方法，該方法使用 OpenAiAudioSpeechModel 將文本轉換為音頻文件字節。

接下來，我們創建 TextToSpeechController。

@RestController
public class TextToSpeechController {
    private final TextToSpeechService textToSpeechService;

    @Autowired
    public TextToSpeechController(TextToSpeechService textToSpeechService) {
        this.textToSpeechService = textToSpeechService;
    }

    @GetMapping("/text-to-speech")
    public ResponseEntity<byte[]> generateSpeechForText(@RequestParam String text) {
        return ResponseEntity.ok(textToSpeechService.makeSpeech(text));
    }
}

最後，我們測試我們的端點：

@SpringBootTest
@ExtendWith(SpringExtension.class)
@AutoConfigureMockMvc
@EnabledIfEnvironmentVariable(named = "OPENAI_API_KEY", matches = ".*")
class TextToSpeechLiveTest {

    @Autowired
    private MockMvc mockMvc;

    @Autowired
    private TextToSpeechService textToSpeechService;

    @Test
    void givenTextToSpeechService_whenCallingTextToSpeechEndpoint_thenExpectedAudioFileBytesShouldBeObtained() throws Exception {
        byte[] audioContent = mockMvc.perform(get("/text-to-speech")
          .param("text", "Hello from Baeldung"))
          .andExpect(status().isOk())
          .andReturn()
          .getResponse()
          .getContentAsByteArray();

        assertNotEquals(0, audioContent.length);
    }
}

我們調用文本轉語音端點並驗證響應代碼和非空內容。如果將內容保存到文件，則我們會得到包含我們語音的 MP3 文件。

4. 添加實時音頻端點

我們可能會在一次性獲取大型音頻內容（以一個巨大的字節數組形式）時，面臨顯著的內存消耗。此外，有時我們希望在音頻完全上傳之前就開始播放它。為此，OpenAI 支持流式文本轉語音響應。

讓我們擴展我們的 TextToSpeechService 以支持此功能：

public Flux<byte[]> makeSpeechStream(String text) {
    SpeechPrompt speechPrompt = new SpeechPrompt(text);
    Flux<SpeechResponse> responseStream = openAiAudioSpeechModel.stream(speechPrompt);

    return responseStream
      .map(SpeechResponse::getResult)
      .map(Speech::getOutput);
}

我們已添加了 makeSpeechStream() 方法。 在此，我們使用 OpenAiAudioSpeechModel 的 stream() 方法，以字節塊流的形式生成數據。

接下來，我們創建 HTTP 上下文流字節的端點：

@GetMapping(value = "/text-to-speech-stream", produces = MediaType.APPLICATION_OCTET_STREAM_VALUE)
public ResponseEntity<StreamingResponseBody> streamSpeech(@RequestParam("text") String text) {
    Flux<byte[]> audioStream = textToSpeechService.makeSpeechStream(text);

    StreamingResponseBody responseBody = outputStream -> {
        audioStream.toStream().forEach(bytes -> {
            try {
                outputStream.write(bytes);
                outputStream.flush();
            } catch (IOException e) {
                throw new UncheckedIOException(e);
            }
        });
    };

    return ResponseEntity.ok()
      .contentType(MediaType.APPLICATION_OCTET_STREAM)
      .body(responseBody);
}

現在我們迭代字節流，並將它寫入到 StreamingResponseBody 中。如果使用 WebFlux，我們將會直接從端點返回 Flux。 作為一種選擇，我們可以使用 application/octet-stream 內容類型來指示響應是一個流。 現在我們來測試我們的流式方法：

@Test
void givenStreamingEndpoint_whenCalled_thenReceiveAudioFileBytes() throws Exception {

    String longText = """
          Hello from Baeldung!
          Here, we explore the world of Java,
          Spring, and web development with clear, practical tutorials.
          Whether you're just starting out or diving deep into advanced
          topics, you'll find guides to help you write clean, efficient,
          and modern code.
          """;

    mockMvc.perform(get("/text-to-speech-stream")
        .param("text", longText)
        .accept(MediaType.APPLICATION_OCTET_STREAM))
      .andExpect(status().isOk())
      .andDo(result -> {
          byte[] response = result.getResponse().getContentAsByteArray();
          assertNotNull(response);
          assertTrue( response.length > 0);
      });
}

我們調用我們的流式端點並驗證它返回了一個字節數組。 MockMvc 收集完整的響應主體，但我們也可以將其讀取為流。

5. 為特定調用定製模型參數

有時我們需要為特定調用覆蓋模型選項。為此，我們可以使用 OpenAiAudioSpeechOptions。讓我們更新我們的 TextToSpeechService 以支持自定義語音選項：

public byte[] makeSpeech(String text, OpenAiAudioSpeechOptions speechOptions) {
    SpeechPrompt speechPrompt = new SpeechPrompt(text, speechOptions);

    SpeechResponse response = openAiAudioSpeechModel.call(speechPrompt);

    return response.getResult().getOutput();
}

我們已覆蓋了 makeSpeech() 函數，並添加了 OpenAiAudioSpeechOptions 參數。我們將其作為調用 OpenAI API 的參數使用。如果傳遞一個空對象，將使用默認選項。

現在，我們創建了一個新的端點，該端點接受語音參數：

@GetMapping("/text-to-speech-customized")
public ResponseEntity<byte[]> generateSpeechForTextCustomized(@RequestParam("text") String text, @RequestParam Map<String, String> params) {
    OpenAiAudioSpeechOptions speechOptions = OpenAiAudioSpeechOptions.builder()
      .model(params.get("model"))
      .voice(OpenAiAudioApi.SpeechRequest.Voice.valueOf(params.get("voice")))
      .responseFormat(OpenAiAudioApi.SpeechRequest.AudioResponseFormat.valueOf(params.get("responseFormat")))
      .speed(Float.parseFloat(params.get("speed")))
      .build();

    return ResponseEntity.ok(textToSpeechService.makeSpeech(text, speechOptions));
}

在這裏，我們獲取語音參數映射，並構建 OpenAiAudioSpeechOptions。

最後，讓我們測試新的端點：

@Test
void givenTextToSpeechService_whenCallingTextToSpeechEndpointWithAnotherVoiceOption_thenExpectedAudioFileBytesShouldBeObtained() throws Exception {
    byte[] audioContent = mockMvc.perform(get("/text-to-speech-customized")
      .param("text", "Hello from Baeldung")
      .param("model", "tts-1")
      .param("voice", "NOVA")
      .param("responseFormat", "MP3")
      .param("speed", "1.0"))
    .andExpect(status().isOk())
    .andReturn()
    .getResponse()
    .getContentAsByteArray();

    assertNotEquals(0, audioContent.length);
}

我們調用了該端點，並使用 NOVA 語音對該請求進行了處理。正如預期的那樣，我們收到了帶有覆蓋語音的音頻字節。

6. 結論

文本轉語音 API 使我們能夠從文本生成自然語音。憑藉簡單的配置和現代模型，我們可以將動態語音交互帶入我們的應用程序中。

在本文中，我們探討了如何使用 Spring AI 將我們的應用程序與 OpenAI TTS 模型集成。同樣，我們也可以與其他的 TTS 模型集成，或者構建我們自己的。

知識庫 / Spring / Spring AI RSS 訂閱

1. 簡介

2. 依賴項和配置

3. 構建文本轉語音應用程序

4. 添加實時音頻端點

5. 為特定調用定製模型參數

6. 結論

發佈評論

Product

Company

Support

Company

知識庫 / Spring / Spring AI RSS 訂閱