Transcription

Video transcription consists in transcribing the audio content of a video to a text.

This process might be called Automatic Speech Recognition or Speech to Text in more general context.

Provide a common API to many transcription backend, currently:

openai-whisper CLI
faster-whisper (via whisper-ctranslate2 CLI)

Potential candidates could be: whisper-cpp, vosk, ...

Requirements

Python 3
PIP

And at least one of the following transcription backend:

Python:
- openai-whisper
- whisper-ctranslate2>=0.4.3

Usage

Create a transcriber manually:

import { OpenaiTranscriber } from '@peertube/peertube-transcription'

(async () => {
  // Optional if you want to use a local installation of transcribe engines
  const binDirectory = 'local/pip/path/bin'

  // Create a transcriber powered by OpenAI Whisper CLI
  const transcriber = new OpenaiTranscriber({
    name: 'openai-whisper',
    command: 'whisper',
    languageDetection: true,
    binDirectory
  });

  // If not installed globally, install the transcriber engine (use pip under the hood)
  await transcriber.install('local/pip/path')

  // Transcribe
  const transcriptFile = await transcriber.transcribe({
    mediaFilePath: './myVideo.mp4',
    model: 'tiny',
    format: 'txt'
  });

  console.log(transcriptFile.path);
  console.log(await transcriptFile.read());
})();

Using a local model file:

import { WhisperBuiltinModel } from '@peertube/peertube-transcription/dist'

const transcriptFile = await transcriber.transcribe({
  mediaFilePath: './myVideo.mp4',
  model: await WhisperBuiltinModel.fromPath('./models/large.pt'),
  format: 'txt'
});

You may use the builtin Factory if you're happy with the default configuration:

import { transcriberFactory } from '@peertube/peertube-transcription'

transcriberFactory.createFromEngineName({
  engineName: transcriberName,
  logger: compatibleWinstonLogger,
  transcriptDirectory: '/tmp/transcription'
})

For further usage ../tests/src/transcription/whisper/transcriber/openai-transcriber.spec.ts

Lexicon

ONNX: Open Neural Network eXchange. A specification, the ONNX Runtime run these models.
GPTs: Generative Pre-Trained Transformers
LLM: Large Language Models
NLP: Natural Language Processing
MLP: Multilayer Perceptron
ASR: Automatic Speech Recognition
WER: Word Error Rate
CER: Character Error Rate