MP4格式详解

发布时间：2025-12-09 16:03:18 浏览次数：3

mp4概述

MP4文件中的所有数据都装在box（QuickTime中为atom）中，也就是说MP4文件由若干个box组成，每个box有类型和长度，可以将box理解为一个数据对象块。box中可以包含另一个box，这种box称为container box。一个MP4文件首先会有且只有一个“ftyp”类型的box，作为MP4格式的标志并包含关于文件的一些信息；之后会有且只有一个“moov”类型的box（Movie Box），它是一种container box，子box包含了媒体的metadata信息；MP4文件的媒体数据包含在“mdat”类型的box（Midia Data Box）中，该类型的box也是container box，可以有多个，也可以没有（当媒体数据全部引用其他文件时），媒体数据的结构由metadata进行描述。

下面是一些概念：

track 表示一些sample的集合，对于媒体数据来说，track表示一个视频或音频序列。
hint track 这个特殊的track并不包含媒体数据，而是包含了一些将其他数据track打包成流媒体的指示信息。
sample 对于非hint track来说，video sample即为一帧视频，或一组连续视频帧，audio sample即为一段连续的压缩音频，它们统称sample。对于hint tracksample定义一个或多个流媒体包的格式。
sample table 指明sampe时序和物理布局的表。
chunk 一个track的几个sample组成的单元。

基本概念

文件，由许多Box和FullBox组成。

Box，每个Box由Header和Data组成。

FullBox，是Box的扩展，Box结构的基础上在Header中增加8bits version和24bits flags。

Header，包含了整个Box的长度size和类型type。当 size == 0时，代表这是文件中最后一个Box；当size==1时，意味着Box长度需要更多bits来描述，在后面会定义一个64bits的largesize描述Box的长度；当type是uuid时，代表Box中的数据是用户自定义扩展类型。

Data，是Box的实际数据，可以是纯数据也可以是更多的子Boxes。

当一个Box的Data中是一系列子Box时，这个Box又可成为Container Box。

结构如下图：

MP4文件格式概览

MP4文件由多个box组成，每个box存储不同的信息，且box之间是树状结构，如下图所示。

box类型有很多，下面是3个比较重要的顶层box：

ftyp：File Type Box，描述文件遵从的MP4规范与版本；
moov：Movie Box，媒体的metadata信息，有且仅有一个。
mdat：Media Data Box，存放实际的媒体数据，一般有多个；

isom（ISO Base Media file）是在 MPEG-4 Part 12 中定义的一种基础文件格式，MP4、3gp、QT 等常见的封装格式，都是基于这种基础文件格式衍生的。

MP4 文件可能遵循的规范有mp41、mp42，而mp41、mp42又是基于isom衍生出来的。

3gp(3GPP)：一种容器格式，主要用于3G手机上；
QT：QuickTime的缩写，.qt 文件代表苹果QuickTime媒体文件；

ftyp定义
ftyp 定义如下：

aligned(8) class FileTypeBox extends Box(‘ftyp’) { unsigned int(32) major_brand; unsigned int(32) minor_version; unsigned int(32) compatible_brands[]; // to end of the box }

下面是是 brand 的描述，其实就是具体封装格式对应的代码，用4个字节的编码来表示，比如 mp41。

A brand is a four-letter code representing a format or subformat. Each
file has a major brand (or primary brand), and also a compatibility
list of brands.

ftyp 的几个字段的含义：

major_brand：比如常见的isom、mp41、mp42、avc1、qt等。它表示“最好”基于哪种格式来解析当前的文件。举例，major_brand 是A，compatible_brands 是 A1，当解码器同时支持 A、A1规范时，最好使用A规范来解码当前媒体文件，如果不支持A规范，但支持A1规范，那么，可以使用A1规范来解码；
minor_version：提供 major_brand 的说明信息，比如版本号，不得用来判断媒体文件是否符合某个标准/规范；
compatible_brands：文件兼容的brand列表。比如 mp41 的兼容 brand 为 isom。通过兼容列表里的brand 规范，可以将文件部分（或全部）解码出来；在实际使用中，不能把 isom 做为major_brand，而是需要使用具体的brand（比如mp41），因此，对于 isom，没有定义具体的文件扩展名、mime type。

在实际使用中，不能把 isom 做为 major_brand，而是需要使用具体的brand（比如mp41），因此，对于isom，没有定义具体的文件扩展名、mime type。

下面是常见的几种brand，以及对应的文件扩展名、mime type，更多brand可以参考这里。

在讨论 MP4 规范时，提到AVC，有的时候指的是“AVC文件格式”，有的时候指的是"AVC压缩标准（H.264）"，这里简单做下区分。

AVC文件格式：基于 ISO基础文件格式衍生的，使用的是AVC压缩标准，可以认为是MP4的扩展格式，对应的brand 通常是avc1，在MPEG-4 PART 15 中定义。
AVC压缩标准（H.264）：在MPEG-4 Part 10中定义。
ISO基础文件格式(Base Media File Format) 在 MPEG-4 Part 12 中定义。

moov（Movie Box）

Movie Box，存储 mp4 的 metadata，一般位于mp4文件的开头。

aligned(8) class MovieBox extends Box(‘moov’){ }

moov中，最重要的两个box是 mvhd 和 trak：

mvhd：Movie Header Box，mp4文件的整体信息，比如创建时间、文件时长等；
trak：Track Box，一个mp4可以包含一个或多个轨道（比如视频轨道、音频轨道），轨道相关的信息就在trak里。trak是container box，至少包含两个box，tkhd、mdia；

mvhd针对整个影片，tkhd针对单个track，mdhd针对媒体，vmhd针对视频，smhd针对音频，可以认为是从宽泛 >具体，前者一般是从后者推导出来的。

mvhd（Movie Header Box）
MP4文件的整体信息，跟具体的视频流、音频流无关，比如创建时间、文件时长等。

定义如下：

aligned(8) class MovieHeaderBox extends FullBox(‘mvhd’, version, 0) { if (version==1) {unsigned int(64) creation_time;unsigned int(64) modification_time;unsigned int(32) timescale;unsigned int(64) duration;} else { // version==0unsigned int(32) creation_time;unsigned int(32) modification_time;unsigned int(32) timescale;unsigned int(32) duration;}template int(32) rate = 0x00010000; // typically 1.0template int(16) volume = 0x0100; // typically, full volume const bit(16) reserved = 0;const unsigned int(32)[2] reserved = 0;template int(32)[9] matrix ={ 0x00010000,0,0,0,0x00010000,0,0,0,0x40000000 };// Unity matrixbit(32)[6] pre_defined = 0;unsigned int(32) next_track_ID;}

字段含义如下：

creation_time：文件创建时间；
modification_time：文件修改时间；
timescale：一秒包含的时间单位（整数）。举个例子，如果timescale等于1000，那么，一秒包含1000个时间单位（后面track等的时间，都要用这个来换算，比如track的duration为10,000，那么，track的实际时长为10,000/1000=10s）；
duration：影片时长（整数），根据文件中的track的信息推导出来，等于时间最长的track的duration；
rate：推荐的播放速率，32位整数，高16位、低16位分别代表整数部分、小数部分（[16.16]），举例 0x0001 0000代表1.0，正常播放速度；
volume：播放音量，16位整数，高8位、低8位分别代表整数部分、小数部分（[8.8]），举例 0x01 00 表示1.0，即最大音量；
matrix：视频的转换矩阵，一般可以忽略不计；
next_track_ID：32位整数，非0，一般可以忽略不计。当要添加一个新的track到这个影片时，可以使用的track id，必须比当前已经使用的track id要大。也就是说，添加新的track时，需要遍历所有track，确认可用的track id；

tkhd（Track Box）
单个 track 的 metadata，包含如下字段：

version：tkhd box的版本；
flags：按位或操作获得，默认值是7（0x000001 | 0x000002 |0x000004），表示这个track是启用的、用于播放的且用于预览的。
Track_enabled：值为0x000001，表示这个track是启用的，当值为0x000000，表示这个track没有启用；
Track_in_movie：值为0x000002，表示当前track在播放时会用到；
Track_in_preview：值为0x000004，表示当前track用于预览模式；
creation_time：当前track的创建时间；
modification_time：当前track的最近修改时间；
track_ID：当前track的唯一标识，不能为0，不能重复；
duration：当前track的完整时长（需要除以timescale得到具体秒数）；
layer：视频轨道的叠加顺序，数字越小越靠近观看者，比如1比2靠上，0比1靠上；
alternate_group：当前track的分组ID，alternate_group值相同的track在同一个分组里面。同个分组里的track，同一时间只能有一个track处于播放状态。当alternate_group为0时，表示当前track没有跟其他track处于同个分组。一个分组里面，也可以只有一个track；
volume：audio track的音量，介于0.0~1.0之间；
matrix：视频的变换矩阵；
width、height：视频的宽高；

定义如下：

aligned(8) class TrackHeaderBox extends FullBox(‘tkhd’, version, flags){ if (version==1) {unsigned int(64) creation_time;unsigned int(64) modification_time;unsigned int(32) track_ID;const unsigned int(32) reserved = 0;unsigned int(64) duration;} else { // version==0unsigned int(32) creation_time;unsigned int(32) modification_time;unsigned int(32) track_ID;const unsigned int(32) reserved = 0;unsigned int(32) duration;}const unsigned int(32)[2] reserved = 0;template int(16) layer = 0;template int(16) alternate_group = 0;template int(16) volume = {if track_is_audio 0x0100 else 0}; const unsigned int(16) reserved = 0;template int(32)[9] matrix= { 0x00010000,0,0,0,0x00010000,0,0,0,0x40000000 }; // unity matrixunsigned int(32) width;unsigned int(32) height;}

例子如下：

mehd是可选的，用来声明影片的完整时长（fragment_duration）。如果不存在，则需要遍历所有的fragment，来获得完整的时长。对于fmp4的场景，fragment_duration一般没办法提前预知。

aligned(8) class MovieExtendsHeaderBox extends FullBox(‘mehd’, version, 0) {if (version==1) {unsigned int(64) fragment_duration;} else { // version==0unsigned int(32) fragment_duration;}}

trex（Track Extends Box）

用来给 fMP4 的 sample 设置各种默认值，比如时长、大小等。

aligned(8) class TrackExtendsBox extends FullBox(‘trex’, 0, 0){ unsigned int(32) track_ID;unsigned int(32) default_sample_description_index; unsigned int(32) default_sample_duration;unsigned int(32) default_sample_size;unsigned int(32) default_sample_flags}

字段含义如下：

track_id：对应的 track 的 ID，比如video track、audio track 的ID；
default_sample_description_index：sample description 的默认index（指向stsd）；
default_sample_duration：sample 默认时长，一般为0；
default_sample_size：sample 默认大小，一般为0；
default_sample_flags：sample 的默认flag，一般为0；

default_sample_flags 占4个字节，比较复杂，结构如下：

老版本规范里，前6位都是保留位，新版规范里，只有前4位是保留位。is_leading 含义不是很直观，下一小节会专门讲解下。

reserved：4 bits，保留位；
is_leading：2 bits，是否 leading sample，可能的取值包括：
- 0：当前 sample 不确定是否 leading sample；（一般设为这个值）
- 1：当前 sample 是 leading sample，并依赖于 referenced I frame 前面的sample，因此无法被解码；
- 2：当前 sample 不是 leading sample；
- 3：当前 sample 是 leading sample，不依赖于 referenced I frame 前面的sample，因此可以被解码；
sample_depends_on：2 bits，是否依赖其他sample，可能的取值包括：
- 0：不清楚是否依赖其他sample；
- 1：依赖其他sample（不是I帧）；
- 2：不依赖其他sample（I帧）；
- 3：保留值；
sample_is_depended_on：2 bits，是否被其他sample依赖，可能的取值包括：
- 0：不清楚是否有其他sample依赖当前sample；
- 1：其他sample可能依赖当前sample；
- 2：其他sample不依赖当前sample；
- 3：保留值；
sample_has_redundancy：2 bits，是否有冗余编码，可能的取值包括：
- 0：不清楚是否存在冗余编码；
- 1：存在冗余编码；
- 2：不存在冗余编码；
- 3：保留值；
sample_padding_value：3 bits，填充值；
sample_is_non_sync_sample：1 bits，不是关键帧；
sample_degradation_priority：16 bits，降级处理的优先级（一般针对如流传过程中出现的问题）；

例子如下：

关于 is_leading

is_leading 不是特别好解释，这里贴上原文，方便大家理解。

A leading sample (usually a picture in video) is defined relative to a reference sample, which is the immediately prior sample that is marked as “sample_depends_on” having no dependency (an I picture). A leading sample has both a composition time before the reference sample, and possibly also a decoding dependency on a sample before the reference sample. Therefore if, for example, playback and decoding were to start at the reference sample, those samples marked as leading would not be needed and might not be decodable. A leading sample itself must therefore not be marked as having no dependency.

为方便讲解，下面的 leading frame 对应 leading sample，referenced frame 对应 referenced samle。

以 H264编码为例，H264 中存在 I帧、P帧、B帧。由于 B帧的存在，视频帧的解码顺序、渲染顺序可能不一致。

mp4文件的特点之一，就是支持随机位置播放。比如，在视频网站上，可以拖动进度条快进。

很多时候，进度条定位的那个时刻，对应的不一定是 I帧。为了能够顺利播放，需要往前查找最近的一个 I帧，如果可能的话，从最近的 I帧开始解码播放（也就是说，不一定能从前面最近的I帧播放）。

将上面描述的此刻定位到的帧，称作 leading frame。leading frame 前面最近的一个 I 帧，叫做 referenced frame。

回顾下 is_leading 为 1 或 3 的情况，同样都是 leading frame，什么时候可以解码（decodable），什么时候不能解码（not decodable）？

1: this sample is a leading sample that has a dependency before the referenced I‐picture (and is therefore not decodable);
3: this sample is a leading sample that has no dependency before the referenced I‐picture (and is therefore decodable);

1、is_leading 为 1 的例子：如下所示，帧2（leading frame）解码依赖帧1、帧3（referenced frame）。在视频流里，从帧2 往前查找，最近的 I帧是帧3。哪怕已经解码了帧3，帧2 也解不出来。

2、is_leading 为 3 的例子：如下所示，此时，帧2（leading frame）可以解码出来。

moof（Movie Fragment Box）

moof是个container box，相关 metadata 在内嵌box里，比如 mfhd、 tfhd、trun 等。

伪代码如下：

aligned(8) class MovieFragmentBox extends Box(‘moof’){ }

结构比较简单，sequence_number 为 movie fragment 的序列号。根据 movie fragment 产生的顺序，从1开始递增。

aligned(8) class MovieFragmentHeaderBox extends FullBox(‘mfhd’, 0, 0){unsigned int(32) sequence_number;}

traf（Track Fragment Box）

aligned(8) class TrackFragmentBox extends Box(‘traf’){ }

对 fmp4 来说，数据被氛围多个 movie fragment。一个 movie fragment 可包含多个track fragment（每个 track 包含0或多个 track fragment）。每个 track fragment 中，可以包含多个该 track 的 sample。

每个 track fragment 中，包含多个 track run，每个 track run 代表一组连续的sample。

tfhd 用来设置 track fragment 中的 sample 的 metadata 的默认值。

伪代码如下，除了 track_ID，其他都是可选字段。

aligned(8) class TrackFragmentHeaderBox extends FullBox(‘tfhd’, 0, tf_flags){unsigned int(32) track_ID;// all the following are optional fields unsigned int(64) base_data_offset; unsigned int(32) sample_description_index; unsigned int(32) default_sample_duration; unsigned int(32) default_sample_size; unsigned int(32) default_sample_flags}

sample_description_index、default_sample_duration、default_sample_size 没什么好讲的，这里只讲解下 tf_flags、base_data_offset。

首先是 tf_flags，不同 flag 的值如下（同样是求按位求或）：

0x000001 base‐data‐offset‐present：存在 base_data_offset 字段，表示数据位置相对于整个文件的基础偏移量。
0x000002 sample‐description‐index‐present：存在 sample_description_index字段；
0x000008 default‐sample‐duration‐present：存在 default_sample_duration字段；
0x000010 default‐sample‐size‐present：存在 default_sample_size 字段；
0x000020 default‐sample‐flags‐present：存在 default_sample_flags 字段；
0x010000 duration‐is‐empty：表示当前时间段不存在sample，default_sample_duration如果存在则为0 ；
0x020000 default‐base‐is‐moof：如果 base‐data‐offset‐present为1，则忽略这个flag。如果 base‐data‐offset‐present 为0，则当前 track fragment 的base_data_offset 是从 moof 的第一个字节开始计算；

sample 位置计算公式为 base_data_offset + data_offset，其中，data_offset 每个 sample 单独定义。如果未显式提供 base_data_offset，则 sample 的位置的通常是基于 moof 的相对位置。

举个例子，比如 tf_flags 等于 57，表示存在 base_data_offset、default_sample_duration、default_sample_flags。

base_data_offset 为 1263 （ftyp、moov 的size 之和为 1263）。

trun 伪代码如下：

aligned(8) class TrackRunBox extends FullBox(‘trun’, version, tr_flags) {unsigned int(32) sample_count;// the following are optional fieldssigned int(32) data_offset;unsigned int(32) first_sample_flags;// all fields in the following array are optional{unsigned int(32) sample_duration;unsigned int(32) sample_size;unsigned int(32) sample_flagsif (version == 0){ unsigned int(32) sample_composition_time_offset; }else{ signed int(32) sample_composition_time_offset; }}[ sample_count ]}

前面听过，track run 表示一组连续的 sample，其中：

sample_count：sample 的数目；
data_offset：数据部分的偏移量；
first_sample_flags：可选，针对当前 track run中第一个 sample 的设置；

tr_flags 如下，大同小异：

0x000001 data‐offset‐present：存在 data_offset 字段；
0x000004 first‐sample‐flags‐present：存在 first_sample_flags字段，这个字段的值，只会覆盖第一个 sample 的flag设置；当 first_sample_flags存在时，sample_flags 则不存在；
0x000100 sample‐duration‐present：每个 sample 都有自己的sample_duration，否则使用默认值；
0x000200 sample‐size‐present：每个 sample 都有自己的 sample_size，否则使用默认值；
0x000400 sample‐flags‐present：每个 sample 都有自己的 sample_flags，否则使用默认值；
0x000800 sample‐composition‐time‐offsets‐present：每个 sample 都有自己的sample_composition_time_offset；
0x000004 first‐sample‐flags‐present，覆盖第一个sample的设置，这样就可以把一组sample中的第一个帧设置为关键帧，其他的设置为非关键帧；

举例如下，tr_flags 为 2565。此时，存在 data_offset 、first_sample_flags、sample_size、sample_composition_time_offset。

上一篇：matlab和Verilog之截位，四舍五入和饱和处理下一篇：Linux系统时间偏差的纠正 adjtimex

知识问答