Extracting ARIB close captions from m2ts streams

Discuss technical and geeky things here.
Post Reply
whitehead
Posts: 10
Joined: Nov 2nd, '18, 02:03
Has thanked: 1 time

Extracting ARIB close captions from m2ts streams

Post by whitehead » Jun 13th, '20, 17:00

Does anyone have a workflow or tool that they can recommend, that can decode ARIB close captions from raw m2ts streams? Ideally, I'd like to get .srt, as the output.

Windows, MacOS or Linux workflow would be fine, I am result-oriented.

Problem:

ffmeg -i <filename> sees the stream:

[....]
Stream #0:0[0x100]: Video: mpeg2video (Main) ([2][0][0][0] / 0x0002), yuv420p(tv, top first), 1440x1080 [SAR 4:3 DAR 16:9], 29.97 fps, 29.97 tbr, 90k tbn, 59.94 tbc
Stream #0:1[0x110]: Audio: aac (LC) ([15][0][0][0] / 0x000F), 48000 Hz, stereo, fltp, 255 kb/s
Stream #0:2[0x130]: Subtitle: arib_caption (Profile A) ([6][0][0][0] / 0x0006)
Stream #0:3[0x138]: Data: bin_data ([6][0][0][0] / 0x0006)
Stream #0:4[0x140]: Unknown: none ([13][0][0][0] / 0x000D)
Stream #0:5[0x160]: Unknown: none ([13][0][0][0] / 0x000D)
Stream #0:6[0x161]: Unknown: none ([13][0][0][0] / 0x000D)
Stream #0:7[0x162]: Unknown: none ([13][0][0][0] / 0x000D)
Stream #0:8[0x163]: Unknown: none ([13][0][0][0] / 0x000D)
Stream #0:9[0x164]: Unknown: none ([13][0][0][0] / 0x000D)
Stream #0:10[0x165]: Unknown: none ([13][0][0][0] / 0x000D)
Stream #0:11[0x111]: Audio: aac ([15][0][0][0] / 0x000F), 0 channels
[...]


and I can attempt to extract it using something like:
ffmpeg -y -i source.m2t -map 0: -vn -an -scodec copy -dn -ignore_unknown -f mpegts output.ts

however having poked around at the result with a hex editor, and tried to open the result as SHIFT-JIS and UTF-8, it is not obvious to me how to convert it to something readable.

ccextractor is not that useful either (or, more likely, I don't know how to use it):

user@host:~$ ccextractor source.m2t
CCExtractor 0.88, Carlos Fernandez Sanz, Volker Quetschke.
Teletext portions taken from Petr Kutalek's telxcc
--------------------------------------------------------------------------
Input: source.m2t
[Extract: 1] [Stream mode: Autodetect]
[Program : Auto ] [Hauppage mode: No] [Use MythTV code: Auto]
[Timing mode: Auto] [Debug: No] [Buffer input: No]
[Use pic_order_cnt_lsb for H.264: No] [Print CC decoder traces: No]
[Target format: .srt] [Encoding: UTF-8] [Delay: 0] [Trim lines: No]
[Add font color data: Yes] [Add font typesetting: Yes]
[Convert case: No] [Video-edit join: No]
[Extraction start time: not set (from start)]
[Extraction end time: not set (to end)]
[Live stream: No] [Clock frequency: 90000]
[Teletext page: Autodetect]
[Start credits text: None]
[Quantisation-mode: CCExtractor's internal function]

-----------------------------------------------------------------
Opening file: source.m2t
Detected MP4 box with name: meta
File seems to be a transport stream, enabling TS mode
Analyzing data in general mode

This TS file has more than one program. These are the program numbers found:
191
791
792
*****ISDB subtitles detected
*****ISDB subtitles detected
*****ISDB subtitles detected
*****ISDB subtitles detected
1% | 00:10*****ISDB subtitles detected
*****ISDB subtitles detected
84% | 25:35*****ISDB subtitles detected
*****ISDB subtitles detected
99% | 30:10*****ISDB subtitles detected
*****ISDB subtitles detected
100% | 30:25
Number of NAL_type_7: 0
Number of VCL_HRD: 0
Number of NAL HRD: 0
Number of jump-in-frames: 0
Number of num_unexpected_sei_length: 0

Min PTS: 15:46:36:674
Max PTS: 16:17:01:680
Length: 00:30:25:006
Done, processing time = 78 seconds

No captions were found in input.
Issues? Open a ticket here

Code: Select all

https://github.com/CCExtractor/ccextractor/issues
user@host:~$
[/font]

Looking for "*****ISDB subtitles detected" I came across

Code: Select all

https://github.com/stz2012/libarib25/
but I am not that skilled to understand what this library does, or to write a tool that would use it. ARIB close captioning spec seems to be available in Japanese only, and requires membership in Association of Radio Industries and Business

whitehead
Posts: 10
Joined: Nov 2nd, '18, 02:03
Has thanked: 1 time

Re: Extracting ARIB close captions from m2ts streams

Post by whitehead » Jul 8th, '20, 22:08

Answering myself, in case someone else will find it useful.

The tool to extract ARIB close captions from m2ts streams is called Caption2ass
Original (not updated since 2015) is here:

Code: Select all

https://github.com/iGlitch/Caption2Ass/
More up to date fork (with a bunch of bugfixes) is here:

Code: Select all

https://github.com/maki-rxrz/Caption2Ass_PCR/
Tool outputs embedded ARIB close captions into either .ass or .srt

metalosaurio
Posts: 8
Joined: Oct 20th, '18, 08:26
Been thanked: 1 time

Re: Extracting ARIB close captions from m2ts streams

Post by metalosaurio » Jul 9th, '20, 19:33

There's the software Amatsukaze too in case you're doing encoding too (though is in Japanese).
It'll handle the OCR better than Caption2ass, or at least will handle more characters without needing to check for the result, as is the case with Caption2ass.

whitehead
Posts: 10
Joined: Nov 2nd, '18, 02:03
Has thanked: 1 time

Re: Extracting ARIB close captions from m2ts streams

Post by whitehead » Jul 12th, '20, 06:33

I think that I am missing something. Could you elaborate?

Currently, if I don't have correct fonts installed, a player that decodes ARIB closed captions (VLC 3.x on PC, IINA on a Mac or similar) will display boxes instead of kanji during playback. To me this indicates that the closed captions are included as text, and not as images, from which it follows that no OCR needs to take place, and the asr/srt that Caption2Ass generates has the same content as ARIB captions embedded in m2ts stream.

You are mentioning encoding. While I don't do encoding/hard subbing, and just embed an srt file into mkv container, would it not make sense, that if I want to do hard subs, and have an SRT file, I would get the contents of SRT in the resulting encoding?

At which point does the OCR take place?

Thank you in advance for your explanation.

metalosaurio
Posts: 8
Joined: Oct 20th, '18, 08:26
Been thanked: 1 time

Re: Extracting ARIB close captions from m2ts streams

Post by metalosaurio » Jul 13th, '20, 00:36

oh, maybe mapping is a better word than OCR in this case.
There're the ARIB外字 characters, not all included in the unicode characters Caption2ass will use. Resulting in a 外字 + huge number in the ass or srt file, whenever the CCs have any of those characters. so you'll have to check the ass/srt file to delete them afterwards.
Amatsukaze will ask you when ARIB外字 appear, what do you want to replace them with, and then it will use same character for subsequent files. But then sadly Amatsukaze doesn't have an option to just extract ass/srt files, but is part of the whole encoding process.

Post Reply

Who is online

Users browsing this forum: No registered users and 2 guests