Transcription API — Submit Call Recordings for Analysis

NeuronLens processes your call recordings asynchronously — you submit an audio file URL, choose which analysis features to run, and poll for results when they are ready. A single job can produce a speaker-diarized transcript, a plain-language summary, per-speaker sentiment scores, detected intent, and a QA evaluation against your scorecard — all from one API call.

Submit a Transcription Job

POST /transcription Submits a call recording for processing. Returns a job_id immediately; use it to poll for results.

Request Body

audio_url

string

required

A publicly accessible URL pointing to the audio file to transcribe. You can use a pre-signed URL from cloud storage (S3, GCS, Azure Blob). The URL must remain accessible for at least 15 minutes after submission.

language

string

required

BCP-47 language code for the recording. This determines the speech recognition model. Examples: hi-IN, ta-IN, en-IN, te-IN, mr-IN.

speakers

integer

default:"2"

Expected number of speakers in the recording for diarization. Accepted values: 1 to 10. For most call recordings, the default of 2 (agent + customer) is correct.

features

array

default:"[\"transcription\"]"

List of analysis features to run on the recording. Including more features increases processing time slightly. Available values:

transcription — speaker-diarized, timestamped speech-to-text (always included)
summary — a plain-language summary of the call (2-4 sentences)
sentiment — sentiment scores per speaker and overall (-1 to 1 scale)
intent — primary customer intent detected from the conversation
qa_scoring — evaluate the call against a QA scorecard (requires qa_scorecard_id)

qa_scorecard_id

string

The ID of the QA scorecard to evaluate against. Required when qa_scoring is included in features. Scorecards are created and managed in the NeuronLens dashboard under QA → Scorecards.

metadata

object

Optional key-value pairs attached to the job. These are passed through unchanged to all response and webhook payloads — useful for correlating jobs with records in your own system. For example: {"crm_ticket_id": "TKT-4492", "agent_id": "agt_001"}.

Example Request

curl https://api.vinfer.ai/v1/transcription \
  -X POST \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "audio_url": "https://storage.example.com/recordings/call_9Hm3kP7qZ.wav?signed=...",
    "language": "hi-IN",
    "speakers": 2,
    "features": ["transcription", "summary", "sentiment", "qa_scoring"],
    "qa_scorecard_id": "qsc_7Lb2pN5kR",
    "metadata": {
      "crm_ticket_id": "TKT-4492",
      "agent_id": "agt_001",
      "call_id": "cal_9Hm3kP7qZ"
    }
  }'

Example Response

{
  "job_id": "job_2Kn7wR4pM",
  "status": "pending",
  "created_at": "2024-02-01T10:00:00Z"
}

Processing time depends on audio duration and the features requested. A typical 3-minute call with all features enabled completes in under 60 seconds. Long recordings (30+ minutes) may take a few minutes. Poll GET /transcription/{job_id} or listen for a transcription.completed webhook event.

Get Job Status and Results

GET /transcription/{job_id} Poll this endpoint to check the status of a submitted job and retrieve results once processing is complete.

Response Fields

job_id

string

The unique identifier for this transcription job.

status

string

Current job status: pending (queued), processing (actively being analyzed), completed (results available), or failed (processing error — check error field for details).

transcript

array

Array of transcript segments, each with speaker label and timing. Present only when status is completed.

Show Transcript segment fields

transcript[].speaker

string

Speaker label: agent or customer, or speaker_1, speaker_2, etc., for recordings with more than two speakers.

transcript[].text

string

The transcribed text for this segment.

transcript[].start_time

number

Start time of the segment in seconds from the beginning of the recording.

transcript[].end_time

number

End time of the segment in seconds.

transcript[].confidence

number

Recognition confidence score for this segment, between 0 and 1. Scores above 0.90 are considered high confidence.

summary

string

A 2-4 sentence plain-language summary of the call. Present when summary was included in features. null otherwise.

sentiment

object

Sentiment analysis results. Present when sentiment was included in features.

Show Sentiment object fields

sentiment.overall

number

Overall call sentiment score from -1 (very negative) to 1 (very positive).

sentiment.by_speaker

object

Sentiment scores broken down by speaker: {"agent": 0.65, "customer": 0.38}.

intent

string

The primary customer intent detected from the conversation. Examples: loan_renewal_interest, complaint, payment_query, dnd_request. Present when intent was included in features. null otherwise.

qa_score

object

QA evaluation results. Present when qa_scoring was included in features.

Show QA score object fields

qa_score.overall_score

number

Aggregate score across all scorecard parameters, expressed as a percentage (0-100).

qa_score.parameter_scores

array

Array of individual parameter evaluations. Each item has parameter_name (string), score (number), max_score (number), and passed (boolean).

metadata

object

The metadata object you submitted with the job, returned unchanged.

created_at

string

ISO 8601 timestamp of when the job was submitted.

completed_at

string

ISO 8601 timestamp of when processing finished. null if still in progress.

Example Response (Completed Job)

{
  "job_id": "job_2Kn7wR4pM",
  "status": "completed",
  "language": "hi-IN",
  "created_at": "2024-02-01T10:00:00Z",
  "completed_at": "2024-02-01T10:00:52Z",
  "transcript": [
    {
      "speaker": "agent",
      "text": "Namaste, Priya ji. Main VInfer ki taraf se baat kar raha hoon. Kya aap abhi baat kar sakte hain?",
      "start_time": 0.4,
      "end_time": 5.1,
      "confidence": 0.97
    },
    {
      "speaker": "customer",
      "text": "Haan, bataiye.",
      "start_time": 5.9,
      "end_time": 7.2,
      "confidence": 0.95
    },
    {
      "speaker": "agent",
      "text": "Aapka loan renewal ka time aa gaya hai. Kya aap is baare mein baat karna chahenge?",
      "start_time": 7.6,
      "end_time": 13.0,
      "confidence": 0.96
    },
    {
      "speaker": "customer",
      "text": "Haan, mujhe kisi se baat karni hai. Please mujhe connect karein.",
      "start_time": 13.5,
      "end_time": 18.2,
      "confidence": 0.94
    }
  ],
  "summary": "The agent contacted Priya Sharma regarding loan renewal. The customer expressed interest and requested to be connected with a human agent to discuss options.",
  "sentiment": {
    "overall": 0.42,
    "by_speaker": {
      "agent": 0.65,
      "customer": 0.38
    }
  },
  "intent": "loan_renewal_interest",
  "qa_score": {
    "overall_score": 87.5,
    "parameter_scores": [
      {"parameter_name": "Greeting Compliance", "score": 10, "max_score": 10, "passed": true},
      {"parameter_name": "Product Pitch Accuracy", "score": 15, "max_score": 20, "passed": false},
      {"parameter_name": "DND Handling", "score": 10, "max_score": 10, "passed": true},
      {"parameter_name": "Escalation Protocol", "score": 10, "max_score": 10, "passed": true}
    ]
  },
  "metadata": {
    "crm_ticket_id": "TKT-4492",
    "agent_id": "agt_001",
    "call_id": "cal_9Hm3kP7qZ"
  }
}

Supported Audio Formats

NeuronLens accepts the following audio formats:

Format	Extension	Notes
WAV	`.wav`	Recommended for best accuracy. Uncompressed PCM preferred.
MP3	`.mp3`	Common for telephony recordings.
OGG	`.ogg`	Ogg Vorbis and Ogg Opus both supported.
FLAC	`.flac`	Lossless — good accuracy, larger file size.
M4A	`.m4a`	AAC audio in MPEG-4 container.

File size limit: 500 MB per submission. Duration limit: 4 hours per submission.

For best transcription accuracy, use recordings with a sample rate of 8 kHz or higher. Telephony recordings at 8 kHz (standard PSTN quality) work well. Stereo recordings where the agent and customer are on separate channels will produce the most accurate diarization — if you have separate-channel recordings, consider indicating "speakers": 2 explicitly.

​Submit a Transcription Job

​Request Body

​Example Request

​Example Response

​Get Job Status and Results

​Response Fields

​Example Response (Completed Job)

​Supported Audio Formats

Submit a Transcription Job

Request Body

Example Request

Example Response

Get Job Status and Results

Response Fields

Example Response (Completed Job)

Supported Audio Formats