MuseCoco: Generating Symbolic Music from Text

Abstract

Generating music from text descriptions is a user-friendly mode since the text is a relatively easy interface for user engagement. While some approaches utilize texts to control music audio generation, editing musical elements in generated audio is challenging for users. In contrast, symbolic music offers ease of editing, making it more accessible for users to manipulate specific musical elements. In this paper, we propose MuseCoco, which generates symbolic music from text descriptions with musical attributes as the bridge to break down the task into text-to-attribute understanding and attribute-to-music generation stages. MuseCoCo stands for Music Composition Copilot that empowers musicians to generate music directly from given text descriptions, offering a significant improvement in efficiency compared to creating music entirely from scratch. The system has two main advantages: Firstly, it is data efficient. In the attribute-to-music generation stage, the attributes can be directly extracted from music sequences, making the model training self-supervised. In the text-to-attribute understanding stage, the text is synthesized and refined by ChatGPT based on the defined attribute templates. Secondly, the system can achieve precise control with specific attributes in text descriptions and offers multiple control options through attribute-conditioned or text-conditioned approaches. MuseCoco outperforms baseline systems in terms of musicality, controllability, and overall score by at least 1.27, 1.08, and 1.32 respectively. Besides, there is a notable enhancement of about 20% in objective control accuracy. In addition, we have developed a robust large-scale model with 1.2 billion parameters, showcasing exceptional controllability and musicality.

Figure 1: The two-stage framework of MuseCoco. Text-to-attribute understanding extracts diverse musical attributes, based on which symbolic music is generated through the attribute-to-music generation stage.

Please note: this work focuses on generating symbolic music, which do not contain timbre information. We use MuseScore (https://musescore.org/en) to export mp3 files for reference, musicians are encouraged to leverage the .mid files for further timbre improvement.

Comparison with Baselines

Group 1

Text Descriptions: The combination of moderate speed, balanced rhythm, and major key, along with its 4/4 time signature, makes this song a prime representation of the classical style. This music would be best suited for a romantic and elegant sunset on the beach. The song structure is made up of about 15 bars. The music is influenced by Bach. The music is given its sound through piano. The use of a specific pitch range of 3 octaves creates a cohesive and unified sound throughout the musical piece. The track lasts for about 40 seconds.

BART	GPT-4	MuseCoco

Comments:

BART-base: The generated sample from BART demonstrates inadequate control over attributes such as time signature, duration, and octaves. Furthermore, the sample's musicality suffers due to a lack of harmonies, resulting in an incomplete composition.

GPT-4: The generated music from GPT-4 is controlled well. However, it tends to sound monotonous since the rhythm patterns lack of variations, which is common in most generated music from GPT-4. Furthermore, some counterpoints in the composition, specifically in the 4th and 10th bars, lack well-designed integration, detracting from the overall quality of the piece.

MuseCoco: The generated music from MuseCoco is controlled precisely. Besides, it has more complex harmonies and more diverse rhythm patterns for a better listening experience.

Group 2

Text Descriptions: This song has a moderate tempo and its setting would best be described as a peaceful forest with a waterfall. Major key adds a unique flavor to this music. The music should feature grand piano, drum and bass. This song is about 40 seconds long.

BART	GPT-4	MuseCoco

Comments:

BART-base: The generated music solely relies on the piano, which deviates from the consistent instrumentation described in the text. In terms of musicality, the rhythm lacks fluency, with many unexpected rests that disrupt the listening experience.

GPT-4: While the generated music exhibits precise control, the overall musicality falls short of expectations. The drum pattern deviates from the usual rhythmic sensibility, failing to provide a satisfying rhythmic foundation. Additionally, the note pitches used for the bass are not suitable, resulting in a lack of a cohesive bass line, which is crucial for a complete and well-rounded composition.

MuseCoco: The generated music demonstrates precise control. The piano effectively carries the melody, while the bass provides a solid foundation with its bass line, enhancing the depth of the composition. The addition of the drums in the middle adds various rhythmic patterns, further enriching the overall musical texture.

Group 3

Text Descriptions: 2/4 is the meter of the music and should feature grand piano. The music is composed of a dense and intricate arrangement of notes, resulting in a captivating and memorable experience with the choice of minor key. The song embodies the characteristics of classical style. 15 seconds is a relatively long time for a song, but the dynamic and diverse pitch range spanning 4 octaves makes the music an interesting and engaging listening experience.

BART	GPT-4	MuseCoco

Comments:

BART-base: The generated music suffers from inaccuracies in its control of time signature, duration, octaves, and note density. Furthermore, the melody line lacks variation, resulting in a monotonous composition with repetitive rhythm patterns and harmonic progressions.

GPT-4: The generated music lacks precision in controlling duration, octaves, and note density. As a result, the composition suffers from a monotonous melody characterized by simplistic rhythms and a lack of harmonies.

MuseCoco: The generated music demonstrates precise control, particularly notable in the dense arrangement of notes. This intricate composition style contributes to its memorability and adds an interesting and captivating quality to the music.

Group 4

Text Descriptions: The moderate tempo and conventional classical roots of this song, along with its 16 ~ 30 second running time and use of major key, creates a rich and dynamic sonic palette with a limited pitch range of 5 octaves that allows for a greater emphasis on the nuances of tone and phrasing. The music is in 3/4. The use of piano is vital to the music. The music follows in Chopin's footsteps.

BART	GPT-4	MuseCoco

Comments:

BART-base: The generated music deviates from the attributes specified in the text description, such as time signature, duration, and octaves. Furthermore, the style of the music does not resemble that of Chopin. Additionally, the abnormal usage of rests within each voice part raises doubts about the composition's coherence and musicality.

GPT-4: The generated music falls short in multiple aspects. It does not span 5 octaves as specified, and it fails to capture the essence of Chopin's style. The melody lacks variation, resulting from limited rhythm patterns and harmonies. Additionally, the inclusion of two identical tracks introduces redundancy and detracts from the overall composition.

MuseCoco: The generated music demonstrates precise control. Specifically, the 5th and 6th bars skillfully capture the distinct essence and emotive qualities associated with Chopin's style, effectively conveying the desired musical expression.

Group 5

Text Descriptions:The music is slow and contemplative, with a pitch range of 4 octaves adding a distinctive character and emphasizing its emotional depth. The song's length is around about 14 bars, and it is not dance-friendly due to its style. The minor key adds a unique flavor, and grand piano, bass, voice, flute and drum should be included in the music. The time signature of the music is 3/4. The song is in an intense tempo with a moderate pace of 16 ~ 30-seconds. The music is a quintessential example of religious genre.

BART	GPT-4	MuseCoco

Comments:

BART-base: The generated music is monophonic with the piano and lacks the bass, voice, flute and drum. And it has the wrong time signature of 6/8.

GPT-4: The generated music is in the incorrect pitch range (1 octaves) and tempo (very slow). And its melody is monotonous and repeated.

MuseCoco: The generated music from MuseCoco is controlled precisely. The moderato chanting presents a peaceful religious style.

Group 6

Text Descriptions:The song's fast-paced rhythm, unique pitch range of 4 octaves, and use of grand piano create a memorable and energetic listening experience, while the 4/4 and minor give it an emotional quality that is similar to Brahms's style. This song is a true representation of the classical genre. The running time of the song is 16 ~ 30 seconds.

BART	GPT-4	MuseCoco

Comments:

BART-base: The generated music lasts too long with a monophonic melody and does not resemble the style of Brahms.

GPT-4: The tempo of generated music from GPT-4 is too slow and the pitch ranges less in one octave, which are not corresponding to the text. And the melody is too simple to relate to Brahms.

MuseCoco: The generated music demonstrates precise control. Specifically, it is of the 4/4 time signature, in minor and with a fast tempo. The development of the multi-layed piano shows the Brahms' style.

Group 7

Text Descriptions:The moderate tempo and conventional classical roots of this song, along with its about 20 second running time and use of major key, creates a rich and dynamic sonic palette with a limited pitch range of 4 octaves that allows for a greater emphasis on the nuances of tone and phrasing. The concert is a true celebration of Schubert's legacy. The concert hall is packed with fans eagerly awaiting the performance. The stage is set with piano, and the beat of this song is extremely strong.

BART	GPT-4	MuseCoco

Comments:

BART-base: The generated music is too long to be corresponding with the text. And it consists of discrete notes and a boring melody, which can not express the classical style of Schubert.

GPT-4: The pitch range of generated music is limited in only one octave. It simple melody lacks the classical style from Schubert.

MuseCoco: The generated music is controlled perfectly. It utilizes the piano with a moderate tempo in major, which presents a relaxed style of the 3/4 meter like the Schubert's minuetto.

Generation Diversity

We test the generation diversity with the same text conditions.

Sample 1

Text Description:

Music is representative of the typical pop sound and spans 13 ~ 16 bars, this is a song that has a bright feeling from the beginning to the end. This song has a very fast and lively rhythm. The use of a specific pitch range of 3 octaves creates a cohesive and unified sound throughout the musical piece.

Generated Samples:

Sample 2

Text Description:

3/4 is the time signature of the music. This song is unmistakably classical in style. This music is low-tempo. The use of piano is vital to the music.

Generated Samples:

Sample 3

Text Description:

The flute creates a lively and upbeat atmosphere in this fast-paced song with just the right tempo, making it perfect for a tropical setting. The time signature of this song is not usual. The music is enriched by oboe, trumpet, flute and tuba.

Generated Samples:

Sample 4

Text Description:

The song's upbeat melody and swift pace, paired with its 3 octave pitch range and 3/4 time signature, offers a diverse and dynamic listening experience that suits a bustling cityscape. Strings, drum, clarinet and flute are utilized in the musical performance. The song's length is around about 14 bars.

Generated Samples:

Sample 5

Text Description:

This song has a steady and moderate rhythm. The music is given its sound through saxophone, drum and trombone. You can hear about 14 bars in this song. 2/4 is the time signature of the music.

Generated Samples:

Controlled by Specific Attributes

The model can also take the specific attribute values rather than text descriptions as the conditions to generate the corresponding music.

Attribute Value	Generated Samples
Rhythm Intensity: moderate Bar: 16 Time Signature: 4/4 Tempo: moderato Time: 30-45s
Instrument: violin Rhythm Intensity: intense Tempo: slow Time: >60s Artist Style: Bach
Instrument: piano drum bass Tempo: moderato

Attribute Value

Generated Samples

Rhythm Intensity: moderate

Bar: 16

Time Signature: 4/4

Tempo: moderato

Time: 30-45s

Instrument: violin

Rhythm Intensity: intense

Tempo: slow

Time: >60s

Artist Style: Bach

Instrument: piano drum bass

Tempo: moderato

Description Length

This section demonstrates the model ability to handle differenet description length.

Short Descriptions

Text Descriptions: This song with a moderate beat is characterized by quiet and has a duration of about 14 bars.

Text Descriptions: This 31 ~ 45-second song embodies the essence of vocal music.

Text Descriptions: The song's sound is steeped in the conventions of religious style, and its pitch range is within 5 octaves. The tempo of this song is moderate.

Long Descriptions

Text Descriptions: The music in this song is defined by its distinctive character, which is emphasized by its pitch range of 6 octaves, adding emotional depth. Furthermore, the major key adds a unique flavor to the music, while its powerful and driving beat contributes to its overall impact. The arrangement of the song omits the use of oboe and organ, and it is written in 4/4, which gives it a balanced beat. The music evokes a strong sense of gusto and lasts for 16 ~ 30 seconds, with a duration of 9 ~ 12 bars. The musical performance employs grand piano and strings.

Text Descriptions: The compact pitch range of 4 octaves results in a focused and impactful musical performance, while the choice of major key adds to the captivating and memorable experience. With a duration of about 40 seconds, the rapid tempo of this song combined with the distinct sound of grand piano, drum, guitar and bass creates its unique musical identity. The meter of the music follows 4/4, and it is played at a moderate tempo, further enhancing its overall effect. The song's style is defined by its religious influences, culminating in a compelling musical composition. The music consists of about 15 bars.